Mamba for Remote Sensing: Architectures, Hybrid Paradigms, and Future Directions

Zefeng Li; Long Zhao; Yihang Lu; Yue Ma; Guoqing Li

doi:10.3390/rs18020243

Highlights

What are the main findings?

We present a comprehensive survey taxonomy of hybrid-architecture designs for integrating Visual Mamba with CNN/Transformer backbones, clarifying recurring integration paradigms and their architectural roles in Remote sensing pipelines.
We provide a unified taxonomy spanning hyperspectral analysis, multimodal fusion, dense perception, and restoration. We also synthesise evidence on when Mamba improves accuracy, when it primarily reduces resource costs, and when established CNN or Transformer designs remain sufficient.

What are the implications of the main findings?

Remote sensing models should treat serialization paths, task regimes, and hardware constraints as first-class design variables when adopting Mamba, selecting scan-aware hybrid architectures instead of assuming state-space models are a universal upgrade.
The community needs scan-aware benchmarks, transparent reporting of efficiency, numerical stability, and closer integration of physics-based priors and Mamba-style SSMs to build robust, reproducible, and practically deployable EO foundation models.

Abstract

Modern Earth observation combines high spatial resolution, wide swath, and dense temporal sampling, producing image grids and sequences far beyond the regime of standard vision benchmarks. Convolutional networks remain strong baselines but struggle to aggregate kilometre-scale context and long temporal dependencies without heavy tiling and downsampling, while Transformers incur quadratic costs in token count and often rely on aggressive patching or windowing. Recently proposed visual state-space models, typified by Mamba, offer linear-time sequence processing with selective recurrence and have therefore attracted rapid interest in remote sensing. This survey analyses how far that promise is realised in practice. We first review the theoretical substrates of state-space models and the role of scanning and serialization when mapping two- and three-dimensional EO data onto one-dimensional sequences. A taxonomy of scan paths and architectural hybrids is then developed, covering centre-focused and geometry-aware trajectories, CNN– and Transformer–Mamba backbones, and multimodal designs for hyperspectral, multisource fusion, segmentation, detection, restoration, and domain-specific scientific applications. Building on this evidence, we delineate the task regimes in which Mamba is empirically warranted—very long sequences, large tiles, or complex degradations—and those in which simpler operators or conventional attention remain competitive. Finally, we discuss green computing, numerical stability, and reproducibility, and outline directions for physics-informed state-space models and remote-sensing-specific foundation architectures. Overall, the survey argues that Mamba should be used as a targeted, scan-aware component in EO pipelines rather than a drop-in replacement for existing backbones, and aims to provide concrete design principles for future remote sensing research and operational practice.

Keywords:

visual state-space models; Mamba; remote sensing; Earth observation; hyperspectral imaging; multimodal data fusion; image restoration; foundation models

1. Introduction

Modern optical and SAR constellations now image the Earth at metre- and sub-metre-scale spatial resolutions, tens to hundreds of spectral bands, and revisit periods of hours to days. Sentinel-2, for example, provides 10 m multispectral imagery with 13 spectral bands and a global revisit frequency of about five days, while commercial missions deliver sub-metre panchromatic and 3–5 m multispectral data. Taken together, this combination of high spatial resolution, high spectral dimensionality, and high temporal frequency yields archives in which individual scenes can reach gigapixel scale and time series can span thousands of acquisitions. Models must capture long-range dependencies across space, spectrum, and time without losing small structures or exceeding realistic memory and energy budgets.

Deep neural networks have replaced hand-crafted descriptors by learning hierarchical features directly from data [1,2]. In remote sensing, convolutional neural networks (CNNs) became the standard backbone for scene classification, semantic segmentation, and object extraction [3,4]. Multi-scale encoders, dilated convolutions, and spectral–spatial fusion networks mitigate small objects, class imbalance, and label noise [5,6]. Even advanced CNNs capture long-range interactions only indirectly, through depth, large kernels, or dilation. As a result, modelling dependencies across full 4 k × 4 k tiles or long time series remains difficult under practical memory budgets.

Transformer-based architectures address this limitation by replacing purely local processing with global self-attention over token sequences [7,8]. When adapted to remote sensing, ViT- and Swin-style backbones improve performance on land-cover mapping, change detection, and multi-temporal analysis [9]. However, the quadratic cost of self-attention in the sequence length

L

is difficult to reconcile with the combined demands of high spatial resolution, large scene extent, and dense temporal sampling. For a

2048 \times 2048

image divided into

16 \times 16

patches,

L = 16,384

, so a single attention map contains

L^{2} \approx 2.7 \times 10^{8}

entries; storing this map alone in 32-bit floating point requires roughly 1.1 GB per layer. Practical models therefore rely on patch cropping, windowed attention, or aggressive downsampling [10,11]. These remedies improve tractability but fracture global context and can suppress precisely those small structures—river channels, roads, building outlines—that are most relevant in Earth observation.

Structured state-space models (SSMs) offer a different route to long-range modeling. S4 showed that certain continuous-time linear dynamical systems can be implemented as convolutional kernels that model very long sequences with linear-time complexity and good numerical stability [12]. Mamba extends this line by introducing selective state-space models whose parameters depend on the current token [13,14,15]. Instead of fixed transition and input matrices, Mamba modulates them as functions of the feature vector, implementing content-aware scans that amplify informative regions and suppress background, while retaining

O (L)

time and memory complexity via parallel scan algorithms.

For Earth observation data, these properties matter at scale. Flattening a

4096 \times 4096

tile into a sequence yields

L

on the order of

10^{7}

, and even patch-wise tokenization produces sequences far longer than in typical natural-image benchmarks. Long spectral vectors and dense temporal stacks further increase

L

. Mamba-style architectures provide a way to process such sequences with linear complexity while conditioning state updates on scene content, an appealing feature for anisotropic geophysical structures and heterogeneous urban layouts. Serialising 2D and 3D EO data into 1D sequences raises design questions that do not arise in text. Scan paths, multi-directional traversal, and the way Mamba is hybridised with convolutions or attention jointly determine which spatial, spectral, and temporal neighbourhoods are captured.

Bao et al. recently presented the first dedicated survey of Vision Mamba techniques in remote sensing, organising roughly 120 studies by backbone design, scan strategy, and application task [16]. Complementary surveys of visual Mamba architectures in generic computer vision [17,18,19,20] and the broader Mamba-360 review of state-space models, which categorises foundational SSMs into gating, structural, and recurrent paradigms across modalities, further situate Mamba within the wider SSM landscape [21]. Building on these efforts, this review adopts an EO-centric perspective and deliberately narrows its focus to questions that are specific to, or particularly critical in, Earth observation. Figure 1 summarises the evolution from foundational SSMs to the Mamba family, together with representative vision and EO variants. We use this timeline as a reference point for the scan mechanisms, hybrid paradigms, and application taxonomies discussed next.

Figure 1. Conceptual evolution of Mamba-style state-space models and their vision and Earth observation variants [12,13,14,15,22,23,24,25,26,27,28,29,30,31].

The remainder of this review is structured as follows. Section 2 revisits continuous- and discrete-time SSM formulations, links them to selective recurrence in visual Mamba backbones, and examines how different scanning mechanisms serialize two- and three-dimensional EO data. Section 3 turns to spectral analysis and multi-source fusion, asking in which regimes Mamba-based models empirically improve over strong CNN and Transformer baselines, and where they mostly replicate existing behaviour with added complexity. Section 4 then examines high-resolution visual perception—semantic segmentation, object detection, and change detection—from the standpoint of scan design and hybrid backbones rather than as a catalogue of architectures. Section 5 covers restoration and generative applications, including super-resolution, pan-sharpening, and spatiotemporal fusion, and highlights where long-range propagation is demonstrably useful and where classical models remain adequate. Section 6 synthesizes cross-cutting issues such as stability, numerical robustness, physical consistency, efficiency, and the emerging landscape of Mamba-based foundation models, and identifies open questions that, in our view, must be addressed before Mamba can be treated as a default choice in EO pipelines. Section 7 summarises when Mamba is warranted in Earth observation, which architectural choices have practical impact, and which research questions remain most pressing.

2. Theoretical Foundations and Architectural Evolution

2.1. From Linear State-Space Models to Visual Mamba

State-space models (SSMs) describe linear dynamical systems via a latent state coupled to the input–output relation and admit standard tools for stability, controllability, and frequency analysis. Recent SSM-based neural sequence models implement this formalism as parameterised recurrent updates in discrete time. These architectures are designed to propagate information with linear complexity in sequence length and can complement self-attention in regimes with long contexts or tight memory budgets. Here we focus on the linear time-invariant (LTI) formulation and its discretization, and relate selective state-space updates to the design of visual Mamba backbones.

2.1.1. LTI State-Space Systems, Discretization, and Selective Recurrence

A continuous-time linear time-invariant (LTI) state-space model describes an input–state–output system as:

\dot{h} (t) = A h (t) + B x (t), y (t) = C h (t)

where

x (t)

is the input sequence,

h (t) \in R^{d}

is the latent state, and

y (t)

is the output. The matrix

A \in R^{d \times d}

governs autonomous dynamics,

B \in R^{d \times 1}

injects the input, and

C \in R^{1 \times d}

maps the state to the output.

For stable

A

, the system admits a causal impulse response and therefore a convolutional representation: the kernel

k (τ) = C e x p (A τ) B, τ \geq 0,

encodes how information propagates over time, providing a principled mechanism for long-range dependency modeling [32,33,34,35,36,37,38,39].

To integrate SSMs into deep neural networks, the continuous dynamics are discretized with step size

Δ > 0

, yielding the recurrence:

h_{k} = \bar{A} h_{k - 1} + \bar{B} x_{k}, y_{k} = C h_{k}

where the discretized parameters are given by the standard zero-order-hold (ZOH) form

\bar{A} = e x p (A Δ), \bar{B} = \int_{0}^{Δ} \exp (A τ) B d τ .

Unrolling the discrete recurrence (assuming

h_{0} = 0

) yields an explicit causal convolution form:

y_{t} = \sum_{i = 0}^{t} (C {\bar{A}}^{i} \bar{B}) x_{t - i}

Equivalently, an LTI SSM induces a global 1D convolution along the sequence,

y = x * K

where the kernel sequence

K

is fully determined by the discretized parameters

(\bar{A}, \bar{B}, C)

, namely

K = [C \bar{B}, C \bar{A} \bar{B}, \dots, C {\bar{A}}^{L - 1} \bar{B}] .

This recurrence–convolution duality clarifies that long-context mixing is achieved through state evolution (equivalently through an induced global kernel), rather than through explicit token–token interactions.

Mamba-style selective SSMs depart from classical SSM layers by making the discretization and input–output projections input-dependent. Specifically, token-conditioned parameters

Δ_{t}

,

B_{t}

,

C_{t}

are computed dynamically from the current input

x_{t}

:

B_{t} = {L i n e a r}_{B} (x_{t}), C_{t} = {L i n e a r}_{C} (x_{t}), Δ_{t} = S o f t p l u s ({L i n e a r}_{Δ} (x_{t})),

These parameters modulate the effective state update rule:

h_{t} = {\overline{A}}_{t} h_{t - 1} + {\overline{B}}_{t} x_{t}, y_{t} = C_{t} h_{t},

in which

{\bar{A}}_{t}

and

{\bar{B}}_{t}

are the discretized counterparts associated with

Δ_{t}

.

This selectivity enables the model to amplify informative tokens while attenuating uninformative regions. In EO terms, this allows the model to allocate computational capacity to salient structures without paying the full cost of dense global attention.

SSMs provide a distinct route to long-range modeling compared with both CNNs and Transformers. CNNs propagate context primarily through local kernels and typically require deeper stacking, larger kernels, or dilations to expand the receptive field, which can introduce resolution–context trade-offs in dense EO prediction. Transformers obtain global context via pairwise token interactions, leading to

O (L^{2} D)

computation and substantial key–value storage. In contrast, SSM-style mixing realizes global propagation through state evolution, typically scaling as

O (L D)

with

O (D)

streaming memory (up to constant factors). This profile is well aligned with RS settings where

L

becomes large due to high-resolution tiling, long time series, or hyperspectral sequences, although practical speed also depends on implementation efficiency and the sequence-length regime (Section 6.5).

2.1.2. Topological Mismatch Between 1D Priors and 2D EO Data

The formulation above assumes a one-dimensional sequence with a natural ordering, as in time series, audio, or text. In those settings, neighbouring tokens in index space are also neighbours in the underlying domain, and long-range dependencies correspond to genuinely large temporal or spatial gaps. Satellite and aerial images, by contrast, lie on a two-dimensional grid, and any flattening into a one-dimensional sequence inevitably distorts neighbourhood relations. This extends naturally to higher-dimensional tensors when spectral and temporal axes are included.

A row- or column-wise raster scan preserves locality along the fast axis but disrupts it along the slow axis. Tokens adjacent in the sequence may correspond to distant pixels, and diagonally adjacent pixels can be separated by many update steps. As the spatial extent

H \times W

grows, the mismatch between sequence distance and Euclidean distance increases, slowing down information propagation and inducing anisotropic artefacts in dense prediction maps on large images [40]. These effects are especially relevant in EO, where rivers, roads, coastlines, and urban blocks exhibit strong directional structure that should be propagated coherently.

Visual SSM architectures mitigate this mismatch by redesigning both the scan trajectory and the recurrent update. Some architectures abandon strict causality and run both forward and backward passes along the same path. Others introduce cross-shaped or diagonal scans so that each pixel can exchange information with neighbours along rows, columns, and diagonals within a small number of steps. A third line of work treats the image as a continuous 2D trajectory and designs serpentine or space-filling curves that shorten the average path between semantically related regions [22,23,41]. Together, these strategies narrow the gap between one-dimensional recurrence and two-dimensional geometry and provide more isotropic context for dense prediction and restoration.

In EO applications, the scan is therefore more than a numerical detail; it is a modelling assumption. Paths that align with expected physical anisotropies—for example along flight direction, river networks, or road grids—favour information flow along those structures, whereas near-isotropic scans are preferable when no dominant direction is known a priori. Many Mamba-based EO models implicitly encode such preferences through their scanning choices, even when these are presented as purely architectural variants.

Figure 2 summarizes the compute–accuracy trade space of long-range operators from a remote-sensing workload perspective. The key takeaway is not that Mamba universally outperforms attention, but that linear-time mixing makes long-context modeling feasible under RS sequence lengths that quickly render quadratic attention impractical. This is why scan design and hybridization, rather than a backbone swap alone, dominate real EO performance.

Figure 2. Theoretical compute scaling and accuracy–cost trade-offs of CNN, Transformer, and Mamba long-range operators.

2.1.3. Visual Backbones and Hybridization

Once SSM layers had become competitive with attention in language modelling, they were rapidly imported into vision backbones. One class of designs replaces convolutional or Transformer blocks by visual SSM blocks at selected stages of the hierarchy, accumulating long-range context on high-resolution feature maps with linear-time complexity [24]. In parallel, lightweight variants adapt the state dimension, projection layers, and gating mechanisms so that SSM layers remain stable and efficient when stacked deeply on large images [42,43,44].

Pure SSM backbones are attractive from an efficiency standpoint but are not always optimal for EO tasks that demand both precise local texture modelling and long-range reasoning. Hybrid architectures therefore combine Mamba-style SSM blocks with convolutions or attention. A common pattern uses CNN layers in early stages to capture local edges, fine structures, and radiometric statistics, while SSM layers appear at mid and late stages to propagate information across large spatial extents or long temporal windows. Some models retain a small number of Transformer layers at the coarsest scales to explicitly model region-to-region and object-to-object relations on top of SSM features [24,42].

From a remote sensing perspective, this evolution turns visual SSMs into flexible building blocks rather than monolithic replacements. The same Mamba layer can augment a U-Net-like encoder–decoder, a Swin-style hierarchical Transformer, or a multi-branch fusion network, depending on spatial resolution, modality mix, and resource envelope [42,43,44]. In the remainder of this review, the interplay between structured linear dynamics, selective recurrence, and hybridization with convolutions or attention is a recurring theme.

2.2. Scanning Mechanisms in Remote Sensing

In visual SSMs, the scan specifies how a 2D EO grid is serialised into a 1D sequence for recurrent state updates. Because recurrence propagates information primarily along the chosen order, scan choices induce different adjacency graphs and thus different inductive biases. To complement qualitative discussions, we report geometry-based indices that can be computed directly from the traversal without running training: (i) the step distance

d_{t} = ∥ π (t + 1) - π (t) ∥_{2}

between consecutive tokens; (ii) the mean step distance

\bar{d}

, the maximum jump

d_{m a x}

, and the jump ratio

J = \frac{1}{N - 1} \sum_{t} 1 [d_{t} > 1]

, which quantify local continuity; and (iii) the diameter of the union adjacency graph after multi-pass fusion, which upper-bounds the number of recurrent steps required for information to propagate across the grid. Table 1 summarises scan families and instantiates these indices on a fixed 256 × 256 tile for a reproducible comparison.

Continuity-aware scanning has also been explored outside EO, for example in DiT-style Mamba diffusion models such as ZigMa, where zigzag paths enforce spatial continuity between neighbouring patches on the image grid [45].

2.2.1. Directional and Multi-Directional Scans

The simplest image serialization is a row-wise or column-wise raster scan. Tokens that are adjacent along the fast axis remain close in the sequence, but neighbours along the slow axis can be separated by

O (H)

or

O (W)

steps, so information travels slowly across large tiles. Bidirectional variants alleviate this bias by running the recurrence forward and backward along the same path, allowing past and future tokens to influence each position. Bi-MambaHSI applies such bidirectional scans along spatial and spectral axes of hyperspectral cubes so that each pixel aggregates context beyond what standard convolutions provide [25].

Omnidirectional schemes increase directional coverage. RS-Mamba designs an omnidirectional selective scan in which states are updated along horizontal, vertical, and diagonal paths, approximating isotropic receptive fields while keeping overall cost linear in the number of pixels [46]. Cross-scan designs such as VMamba limit the number of directions but route information along criss-cross paths. This reduces the diameter of the induced token graph to

O (\sqrt{H W})

, which benefits dense prediction on very large scenes [23].

Spiral and centre-focused scans encode yet another prior. SpiralMamba starts from a central pixel and follows an outward spiral so that the prediction region and its immediate neighbourhood appear in a contiguous subsequence [47]. Related schemes place the target pixel near the sequence centre and use tokens on both sides as spatial context. These designs are well suited to tasks where labels are attached to specific pixels or small patches, because the most relevant neighbourhood is processed as a compact segment while long-range dependencies are propagated along the remainder of the sequence.

In all cases, scan design implicitly states where relevant context is expected—along a dominant axis, across several orientations, or around a centre—and the SSM inherits this inductive bias through its recurrence.

2.2.2. Data-Adaptive and Geometry-Aware Scans

In heterogeneous scenes, hand-crafted raster or cross patterns can be misaligned with object boundaries or acquisition geometry. Several recent models therefore learn traversal paths or receptive regions from data so that the scan better follows local structure. QuadMamba uses a quadtree partitioning guided by learned locality scores; tokens are grouped into blocks with strong internal correlation, and an omnidirectional window-shift strategy moves information between blocks while preserving spatial coherence [48]. FractalMamba++ serializes two-dimensional patches along Hilbert-like fractal curves, which preserve locality across scales and adapt to varying input resolutions without redesigning the scan [49]. DAMamba couples the scan order with learned masks that concentrate updates on roads, building edges, and other salient structures, whereas MDA-RSM reweights multi-directional paths to reflect dominant building orientations and symmetries in urban layouts [50,51].

Task geometry also motivates specialized scans. For road and building extraction, traversals that approximate centre lines or follow estimated curvatures allow elongated man-made structures to be covered by short token paths. In change detection, cross-temporal scans traverse bi-temporal feature pairs at multiple scales, highlighting subtle structural differences between acquisitions while retaining linear complexity in the number of tokens. AtrousMamba adopts an atrous-window strategy with adjustable dilation rates: by increasing dilation, the receptive field grows without adding heavy attention blocks, which is beneficial when large-scale context and fine details must be captured simultaneously [52]. These geometry-aware schemes reduce the discrepancy between Euclidean layout and one-dimensional ordering.

Table 1. Typical scanning mechanisms in visual state-space models with fixed-size quantitative instantiation on a 256 × 256 tile.

Scan Type	$Passes k$	Dense Recurrent Steps	$\bar{d}, d_{m a x}, J$	Diameter Upper Bound After Fusion	Example (as in Your Survey)
Raster (row-wise)	1	65,535	1.988, 255.002, 0.003891	65,535	Vim [22]
Bidirectional raster	2	131,070	1.988, 255.002, 0.003891	65,535	Bi-MambaHSI [25]
Serpentine/zigzag	1	65,535	1, 1, 0	65,535	ZigMa [45]
Spiral/centre-focused (continuous)	1	65,535	1, 1, 0	65,535	SpiralMamba [47]
Cross-scan (4-way)	4	262,140	1, 1, 0	510	VMamba [23]
Omnidirectional (6–8-way)	6–8	393,210–524,280	1, 1, 0	255	RS-Mamba [46]
Adaptive	dynamic	data-dependent	data-depend	data-dependent	DAMamba [50]

2.2.3. Transform-Domain and Irregular-Geometry Serialization

Scanning is not limited to regular grids. In several restoration models, SSM backbones are combined with Fourier or wavelet transforms: spatial tokens are serialized and processed by Mamba layers, while frequency-domain modules refine high-frequency details and impose priors on aliasing and noise [53,54]. FaRMamba makes this interaction explicit by restoring attenuated high-frequency components via multi-scale frequency blocks conditioned on Mamba features [55]. For LiDAR and other irregular point sets, token sequences follow acquisition trajectories or learned neighbourhood graphs instead of raster order, aligning better with local geometry in geometric–semantic fusion networks [56]. In satellite image time series, architectures such as SITSMamba process spatial features with CNN encoders and then apply Mamba along the temporal axis to model crop phenology and related dynamics under multi-task objectives [57]. In these cases, “scan” refers as much to traversal in time, spectrum, or graph space as to traversal across pixel grids.

Table 1 summarises the scan families reviewed above and provides traversal-defined quantitative indices (local continuity and fused-graph propagation bounds) instantiated on a fixed 256 × 256 tile for reproducible comparison. Definitions of these indices and a worked example are provided in Appendix A.1, while Figure 3 visualises representative scan trajectories for intuitive comparison.

Figure 3. (a) Raster scan (baseline), (b) bidirectional scan, (c) four-way selective scan with four orthogonal directions, (d) zigzag scan forming continuous serpentine directions, (e) Hilbert-curve/fractal scan that preserves strong spatial locality across the grid, (f) window-based continuous scan with locally connected U-shaped trajectories, (g) omnidirectional spectral scan (OS-Scan) combining row-wise, column-wise and ±diagonal trajectories, and (h) dynamic adaptive scan (DAS) whose path follows a data-driven importance map and largely skips low-importance regions. Arrow colors distinguish different scan branches/directions (e.g., forward vs. reverse, or different orientations). Dashed segments indicate cross-row/cross-window transitions or auxiliary connections between local trajectories. Dots denote the starting/anchor positions of a scan within a direction or window. In (h), “×” marks skipped (low-importance) tokens, while the highlighted nodes/links indicate the visited (high-importance) tokens and their traversal order.

2.2.4. Empirical Guidelines and Design Trade-Offs

Comparative ablations show that scan patterns mainly differ in how they balance local continuity, directional coverage, and computational cost. In published studies, performance gaps are often modest and depend on the dataset, backbone, and scene structure. Zhu et al. compared one-directional, bidirectional, cross, and omnidirectional scans across multiple backbones and datasets. They found that raster-like or bidirectional scans are already strong baselines, whereas broader directional coverage yields consistent gains mainly for large scenes, pronounced anisotropy, or shallow networks [58]. For EO applications, three pragmatic considerations are particularly useful:

Scene geometry and directional structure. Long, thin, or strongly oriented structures (rivers, roads, building blocks) benefit from cross or omnidirectional scans that shorten paths along their main axes.
Spectral–spatial structure and coupling. Hyperspectral cubes and multi-source stacks call for scans that traverse spatial and spectral axes jointly or explicitly interleave them, rather than treating each band independently.
Sequence length and computational budget. Raster and bidirectional scans incur the smallest overhead and are suitable for very long sequences; omnidirectional and adaptive scans increase constant factors through additional branches or routing modules, even though asymptotic complexity in $H W$ remains linear.

In Mamba-based EO models, the scan should therefore be treated as a task-dependent design choice tied to sensing geometry and resource constraints. Selecting an appropriate scan is often as influential as choosing the backbone itself, because it determines which neighbourhoods the SSM connects within a few recurrent steps and which structures the model can represent efficiently.

2.3. Architectural Hybrids and Design Patterns in EO

In current remote sensing systems, Mamba is rarely used as a standalone backbone. Visual SSM blocks are instead inserted into CNN or Transformer pipelines or attached as temporal and fusion modules around existing encoders. Architectures differ mainly in where Mamba is placed and how it interacts with local operators, rather than in the exact variant of the state space layer. This subsection summarises common integration patterns and their implications for dense prediction, multimodal fusion, and time series modelling. Figure 4 provides a compact schematic of the recurring integration topologies used in this section.

Figure 4. Integration paradigms of Mamba-style state space blocks in Earth observation hybrids. Shaded blocks denote local operators such as convolution or local attention, and unshaded blocks denote state space blocks implemented by selective scan. Solid arrows indicate the main feature flow, whereas dashed arrows indicate skip routing, fusion, or cross-branch interaction. (a) Serial insertion or replacement. (b) Parallel branches. (c) Bifurcated designs.

2.3.1. CNN–Mamba Hybrids for Dense Prediction

A first family of designs follows Serial integration in Figure 4a. It attaches Mamba blocks to convolutional encoders or decoders for dense prediction. Typical U Net style hybrids retain a CNN encoder to capture local textures and object boundaries and replace bottleneck or decoder stages with Mamba layers so that large-scale context is propagated with linear complexity [47,59,60,61,62]. For very-high-resolution segmentation and change detection, CNN–Mamba U-Net variants scale to ultra-large tiles by pairing shallow convolutional stems with deep Mamba decoders. Skip connections preserve fine detail while the decoder aggregates global context [47,63].

A second pattern matches Parallel branches in Figure 4b. The CNN stream emphasises high-frequency details and radiometric statistics, whereas a Mamba stream handles long-range dependencies. Their outputs are fused by interaction modules such as attention and gating that weight features according to local signal characteristics [59,60,61,62]. This division of labour is particularly effective for road networks, river systems, and settlement structures, where sharp boundaries and global connectivity are simultaneously important.

Across these variants, a consistent principle emerges. Most designs place convolutional layers near the input to exploit local stationarity and edge-detection priors. Mamba blocks are introduced deeper in the network (often at lower resolutions), where each token covers a larger region, and cross-region context becomes crucial [47,59,60,61,62,63].

2.3.2. Transformer–Mamba Architectures for Efficiency and Attention

A second family of hybrids couples Mamba with Transformer layers or attention modules. It often follows Serial integration in Figure 4a, motivated by the observation that quadratic attention becomes a bottleneck on dense token grids. A common strategy is to keep attention at a small number of coarse stages, while assigning Mamba blocks to high-resolution stages to reduce the dominant cost. Vision backbones that reserve a few attention layers for coarse feature maps and use SSM layers for long-range propagation on dense grids follow this strategy [24,64].

In hyperspectral analysis, several architectures replace parts of the Transformer encoder with Mamba blocks to handle long spectral spatial sequences more efficiently [65,66]. Some designs use Transformer layers within each scale to model rich interactions among tokens, while Mamba experts handle cross scale propagation in a mixture of experts fashion, allocating capacity according to local spectral or spatial patterns [64]. Others employ Mamba as a spectral or temporal branch in parallel to spatial Transformers, so that spectral signatures and spatial layouts are captured with different inductive biases and then fused at intermediate layers [65,66,67].

For multimodal fusion between optical and SAR images, hybrid backbones often combine CNN or Transformer encoders with Mamba modules that operate on modality-agnostic representations [67]. In these systems, attention is used sparingly to align features across modalities or regions, whereas Mamba layers act as efficient carriers of context across the full scene. Overall, Transformer–Mamba hybrids tend to concentrate attention at a few strategic locations and use Mamba as the long-range backbone.

2.3.3. Multimodal and Temporal Pipelines

A third pattern embeds Mamba inside multimodal or temporal pipelines rather than in purely spatial backbones. For satellite image time series, a common choice is a CNN-based spatial encoder followed by a Mamba temporal encoder that processes sequences of spatial features under multi-task objectives such as crop classification and phenology monitoring [57,68]. This arrangement leverages CNNs for local land-cover geometry and uses linear-time recurrence to handle long acquisition histories.

In multimodal fusion, Mamba layers often appear as cross-modal or cross-scale fusion blocks. For example, some networks adopt separate CNN encoders for hyperspectral and LiDAR or for optical and SAR images, and then use Mamba to aggregate features across modalities and resolutions before prediction [67,69,70]. Other designs place Mamba modules at intermediate stages of encoder–decoder networks to propagate information between spatial scales or acquisition times, improving robustness to missing bands and acquisition gaps [68,71,72]. Li et al. proposed Semi Mamba for semi-supervised multimodal remote sensing feature classification. It introduces a cross modality fusion module that exchanges the state transition matrices across modalities and averages the input matrices to balance modality influence [73].

Bifurcated integration also arises in multitask pipelines where multiple dense outputs must remain mutually consistent. Shen et al. proposed RSMTMamba as a unified multitask framework that performs semantic segmentation, height estimation, and boundary detection in one network. Their design embeds a Mamba-based cross-task feature learning module to model both local and long-range cross-task relations, and uses SSM integrated refine decoders to aggregate context across stages [74].

A consistent rule emerges. Use convolutions or local attention at high resolutions where fine detail matters, and reserve Mamba layers for stages where tokens already represent larger receptive fields or long temporal windows.

2.3.4. Summary

Figure 4 highlights three integration topologies that recur across Earth observation models. Serial integration inserts Mamba blocks into a single trunk, often at the bottleneck or decoder, while keeping local operators near the input. Parallel branches run a local stream and a global stream concurrently and rely on an explicit fusion operator to reconcile their features. Bifurcated designs split a shared stem into task-specific or modality-specific branches and introduce cross-branch exchange before prediction. These routing choices largely determine the accuracy and efficiency trade-offs in practice. They also explain why many strong models treat Mamba as a context propagation component, rather than as a full replacement for convolutions or attention.

3. Spectral Analysis

Spectral analysis tasks exploit the rich information contained in hyperspectral and multispectral measurements, often in combination with auxiliary modalities such as LiDAR or SAR. Mamba-based state-space models have been introduced here not as drop-in replacements for CNNs or Transformers but as linear-time backbones that can follow tailored scan paths, integrate prior knowledge, and remain deployable under real-world resource constraints. This section first discusses hyperspectral image (HSI) classification, then multi-source fusion, and finally unmixing, target detection, and anomaly detection.

3.1. Hyperspectral Image Classification

HSI classification assigns land-cover or material labels to pixels from data cubes with hundreds of contiguous bands, where subtle spectral variations and fine spatial structures jointly determine class boundaries. Early deep-learning approaches relied mainly on CNN backbones and shallow fusion of spectral and spatial cues, which limited their ability to model long-range spectral dependencies and cross-scene variability [5,75,76]. Subsequent Transformer-based HSI classifiers improved global receptive fields but inherited quadratic attention costs and often required aggressive patching or windowing, which is difficult to reconcile with very long spectral–spatial sequences [77,78,79,80,81]. Mamba-based visual SSMs offer a different trade-off by casting spectra, spatial neighbourhoods, or spectral–spatial tokens as 1D sequences processed with linear-time recurrence. Their empirical benefit, however, depends on the sequence construction and on how Mamba is embedded in the surrounding architecture [82,83,84].

3.1.1. Serialization and Selective Scanning

The first design axis is the scan path used to serialize spectral–spatial cubes. Naïve raster scans over H × W × B grids generate very long sequences and mix foreground and background in a way that slows information propagation and weakens inductive bias. A series of Mamba-based HSI classifiers therefore designs the scan itself as an explicit modelling choice rather than an implementation detail.

Centre-focused trajectories reorganise small windows so that spectrally and spatially informative samples around the prediction pixel form a compact contiguous subsequence. Spiral and centre-path scans place the central region early (or near the middle) in the sequence and propagate state updates in both directions. This improves robustness to label noise and mixed pixels on datasets such as Indian Pines and Pavia University [46,85,86]. Other works extend this idea to 3D spectral–spatial paths, where sequences interleave bands and spatial neighbours so that Mamba propagates information jointly across wavelength and space while retaining linear complexity [77,87,88,89,90].

Beyond regular grids, several networks build sequences over superpixels or graphs. Superpixel-based and graph-based Mamba variants unfold tokens along region boundaries or k-NN graphs, which shortens effective sequence length and sharpens boundary representations in heterogeneous landscapes [91,92,93,94,95,96]. These structured scans consistently report moderate but reliable gains over raster or cross-scan baselines on standard HSI benchmarks while keeping the recurrent kernel simple [97,98,99,100,101]. These studies indicate that scan design is a primary driver of HSI classification performance, not a minor implementation choice. Centre-focused, 3D, and region-adaptive paths should therefore be selected to match acquisition geometry and label structure [102,103,104,105,106].

3.1.2. Hybrid CNN–Mamba and Transformer–Mamba Architectures

A second design axis is how Mamba is integrated with convolutional and attention modules. Most competitive HSI classifiers now adopt hybrid backbones where convolutions extract local edges and textures, while Mamba layers capture long-range spectral–spatial dependencies. Representative CNN–Mamba architectures use simple convolutional encoders followed by Mamba blocks or interleaved CNN–Mamba stages to extend context without sacrificing translation equivariance or efficient downsampling [79,80,82,83,107]. Transformer–Mamba hybrids keep attention in selected spectral or channel dimensions, while SSM layers handle long-range spatial and cross-band interactions. This division reduces FLOPs and memory relative to pure Transformers without sacrificing accuracy on Houston, Pavia, and related benchmarks [78,84,88,89,90,108,109,110,111].

These results suggest a pragmatic view: in HSI classification, Mamba is most effective as a complement to convolutions or attention rather than a wholesale replacement. In particular, hybrid designs are attractive when high-resolution maps, long spectral profiles, and limited labels co-exist, while small-patch settings with abundant annotations may still favour lighter CNNs.

3.1.3. Frequency- and Morphology-Enhanced Modelling

A third line of work enriches Mamba with frequency-domain and morphological operators that encode prior knowledge about spectral smoothness and object shape. Wavelet- and Fourier-based architectures decompose HSI cubes into multi-scale frequency components before or within Mamba blocks, allowing the network to treat low-frequency background and high-frequency structures differently and to better reconstruct textured classes such as urban materials or vegetation mosaics [112,113,114]. Morphology-aware designs integrate dilation, erosion, and related operators into token generation, so that shape information is already embedded when sequences are fed into the SSM [95,115,116]. These frequency- and morphology-enhanced models typically provide incremental but consistent improvements over plain CNN–Mamba baselines, especially on noisy scenes or when object boundaries are poorly aligned with pixel grids [99,100,101,114].

3.1.4. Efficient, Few-Shot, and Transferable Learning

In practice, hyperspectral missions are usually constrained by scarce labels, domain shift between sensors, and tight memory and energy budgets on operational platforms. Under these constraints, recent work has begun to explore self-supervised, few-shot, and lightweight Mamba variants. Self-supervised methods embed composite-scanning Mamba blocks inside masked-reconstruction or contrastive frameworks, using large unlabeled HSI archives to pretrain representations that transfer well to downstream classification on Houston, Pavia, and WHU-Hi datasets [99,100,101,117,118]. Few-shot extensions introduce metric-learning heads, dynamic token augmentation, or mixture-of-experts routing and consistently report higher accuracy than supervised Mamba baselines under five-shot settings [86,102,118].

To improve efficiency, several works reduce depth and width, apply re-parameterisation, or use structured pruning in Mamba backbones. Many of these choices are informed by MambaOut-style analyses originally developed for classification models [84,96,104,106,107]. Spectral Mamba-style families further compress spectra by sharing parameters across bands or compressing spectral channels, reducing FLOPs and parameters by factors of three to ten while maintaining comparable accuracy on datasets such as Houston2013 [105,108,109,110,111,112]. These results indicate that Mamba can be made competitive for onboard or embedded HSI processing when architectures are explicitly tuned for parameter and energy efficiency.

3.1.5. Summary

Across these strands, Mamba-based HSI classifiers have evolved from generic visual SSM backbones into task-aware architectures whose performance hinges on scan design, hybridisation with convolutions and attention, and efficiency-oriented training. First, careful design of centre-focused, 3D spectral–spatial, and region-adaptive scans is the main mechanism by which Mamba outperforms raster baselines on real HSI benchmarks, and there is little benefit in using state-space models with naïve sequence orderings. Second, hybrid CNN– and Transformer–Mamba backbones typically provide a better accuracy–efficiency trade-off than either pure CNNs or pure Transformers, especially for large scenes with limited labels. Third, frequency- and morphology-enhanced modules and self-supervised, few-shot, or lightweight variants add robustness in noisy or resource-constrained regimes but also increase design complexity and thus should be reserved for settings where their benefits have been empirically demonstrated.

3.2. Multi-Source Fusion

Many remote-sensing applications combine hyperspectral, multispectral, LiDAR, DSM, and SAR data rather than relying on a single sensor. Differences in imaging physics, spatial resolution, and coverage make fusion non-trivial, especially at large scale where long-range dependencies and misregistration must be handled under strict computational budgets. Transformer-based fusion networks alleviate some of these difficulties, but their quadratic-complexity attention struggles at very high resolution or for long sequences.

Visual state-space models provide an alternative backbone with linear-time sequence modelling and flexible scan strategies. In multi-source settings, Mamba is used not only to propagate long-range context but also to encode cross-modal couplings directly within the state update. Current work can be grouped into three paradigms: cross-state interaction in heterogeneous classification, hybrid backbones with geometric or semantic priors, and frequency-aware or lightweight fusion, extended towards generative and reconstruction tasks.

3.2.1. Heterogeneous Modality Classification (HSI + LiDAR/DSM)

Joint classification of HSI with LiDAR or DSM requires exploiting both spectral signatures and height or geometric cues. A first paradigm replaces late feature concatenation with cross-state interaction: hidden states from different modalities interact during scanning so that cross-modal correlations are encoded in the recurrence.

CSFMamba implements a Cross-State Fusion module that couples convolutional feature extraction with Mamba-based global context modelling for HSI–LiDAR fusion [119]. On MUUFL and Houston2018, it achieves overall accuracies several percentage points higher than CoupledCNN late-fusion baselines, and ablations confirm that removing the cross-state module leads to substantial performance drops. S²CrossMamba extends this idea with an inverted-bottleneck Cross-Mamba design that updates multimodal states dynamically, reaching overall accuracies around 96% on MUUFL and clearly outperforming Transformer-based fusion backbones on MUUFL and Augsburg [75]. MSFMamba combines multi-scale spatial and spectral Mamba blocks with dedicated fusion modules for HSI–LiDAR or HSI–SAR [120,121]. Related methods (e.g., CMFNet, TBi-Mamba, Mb-CMIFSD, and M2FMNet) refine cross-modal interactions via redundancy-aware fusion, triple bidirectional scanning, prototype-constrained self-distillation, or elevation-enhanced Mamba blocks [122,123,124,125,126].

A second paradigm overlays geometric or semantic priors onto Mamba backbones. DAHGMN couples graph convolutional networks with Mamba through hybrid GCN–Mamba blocks and dual-feature attention, where the GCN captures local geometric relationships from LiDAR, and Mamba supplies long-range context [127]. Removing either branch degrades performance, indicating that local graph structure and global state-space dynamics are complementary. Other fusion networks integrate CLIP-guided semantics, tri-branch encoders, or edge-aware priors with Mamba to emphasise semantic structure and contours in complex urban scenes [128,129,130].

A third line focuses on frequency-aware and lightweight designs. LW-FVMamba combines skip-scanning Mamba backbones with frequency-domain channel learners to align multimodal features in the spectral frequency domain [131]. On Houston, it uses roughly 0.12 million parameters and around 40 million FLOPs—significantly lower than ExViT, NCGLF², or standard VMamba—while slightly improving overall accuracy. TFFNet integrates fuzzy logic with Fourier and wavelet transform fusion to handle both uncertainty and spectral–spatial details in misregistered HSI–LiDAR pairs [132].

These three paradigms—cross-state interaction, GCN- or semantics-augmented hybrids, and frequency-aware lightweight designs—form a compact taxonomy for HSI–LiDAR/DSM classification.

3.2.2. Generative Fusion and Reconstruction

Beyond classification, Mamba backbones are also used for generative fusion and reconstruction, including HSI–MSI fusion, pan-sharpening, and spatiotemporal super-resolution. Here the goal is to reconstruct high-quality images from complementary observations while preserving spectral fidelity and fine spatial detail under ill-posed conditions and imperfect registration.

HSI–MSI fusion networks such as FusionMamba and SSCM extend the standard Mamba block into dual-input or cross-Mamba variants that jointly process HSI and MSI streams in the state update [133,134]. Long-range spatial–spectral dependencies are modelled by Mamba layers, while Laplacian or wavelet modules emphasise high-frequency texture. SSRFN decouples spectral correction from spatial enhancement: a CNN-based spectral module first compensates upsampling errors, and a Mamba branch then injects global context into the corrected features [135]. S²CMamba tackles pan-sharpening with dual-branch priors that jointly enforce spatial sharpness and spectral fidelity [136]. MCIFNet instead uses a Mamba backbone to generate latent codes and an implicit decoder that maps coordinates and codes to pixel values, allowing reconstruction at arbitrary resolutions [137].

Registration is sometimes integrated into the fusion process itself. PRFCoAM alternates between a modal-unified local-aware registration module and an interactive attention–Mamba fusion module, mitigating error accumulation that typically occurs in two-stage pipelines [138]. SINet couples Mamba with a multiscale invertible neural network based on Haar wavelets and regularises forward and inverse transforms to limit information loss during fusion [139]. For spatiotemporal fusion, MambaSTFM and STFMamba use visual state-space encoders to process long sequences, paired with task-specific decoders. Expert modules handle spatial alignment or temporal prediction, enabling dense time-series reconstruction with linear-time encoders and modest decoder overhead [68,140].

3.2.3. Summary

Multi-source fusion showcases how Mamba can be specialized for remote sensing. Cross-state fusion replaces patch-wise concatenation with recurrent interaction across modalities; hybrid GCN–Mamba or semantics-guided designs inject geometric and task priors; and frequency-aware, skip-scanning variants demonstrate that high accuracy is compatible with strict parameter and FLOPs budgets. In generative and reconstruction tasks, dual-input Mamba blocks, registration–fusion coupling, and invertible or implicit decoders adapt state-space modelling to ill-posed inverse problems.

3.3. Hyperspectral Unmixing, Target and Anomaly Detection

Hyperspectral imagery provides dense spectral sampling, yet individual pixels typically contain mixtures of several materials embedded in structured backgrounds. Unmixing, target detection, and anomaly detection therefore need to exploit spectral correlations together with spatial context and background statistics. Classical linear-mixing and subspace models are appealing for their physical interpretability but become inaccurate in the presence of nonlinear interactions and complex clutter, while attention-based deep networks improve flexibility at the cost of substantial computation on full images. Recent work introduces Mamba-style state-space models that serialise spectra, spatial neighbourhoods, or pixel trajectories and use tailored scan schemes to balance global-context modelling with computational efficiency.

3.3.1. Hyperspectral Unmixing

Hyperspectral unmixing estimates endmember spectra and their abundances from mixed pixels, a problem that becomes increasingly ill-posed in the presence of noise, nonlinear mixing, and limited supervision. Classical geometrical and statistical approaches provide physically grounded solutions but face difficulties with nonlinear effects and large scenes [141]. Mamba-based networks address these issues by combining local structure, long-range spectral–spatial context, and recurrent state updates.

MBUNet is a representative dual-stream design in which spatial and spectral features are extracted in parallel [142]. Convolutional layers capture local spatial patterns, while a bidirectional Mamba module aggregates global information along spectral–spatial dimensions. On Samson, Jasper Ridge, and Urban, MBUNet reduces both mean spectral angle distance and RMSE relative to a Transformer baseline DeepTrans and a pure Mamba model UNMamba [143]. This suggests that combining convolutions with bidirectional Mamba scanning is important for accurate abundance estimation.

Progressive sequence models such as ProMU treat unmixing as a sequence prediction problem over pixels or regions [144]. Stage-aware Mamba modules and progressive context selection refine abundances step by step. On Urban, ProMU reaches abundance RMSE comparable to image-level Mamba baselines while requiring roughly an order of magnitude fewer FLOPs than pixel-level Transformer models. Similar ideas appear in Mamba-SSFN and DGMNet, where Mamba branches are coupled with multi-scale convolutions or graph convolution to capture non-Euclidean spatial relationships and to improve scalability on large scenes [145,146].

3.3.2. Hyperspectral Target Detection

In hyperspectral data, target detection aims to identify pixels belonging to specified materials or objects within complex, structured backgrounds. Effective detectors must remain sensitive to small targets while being robust to background variability and spectral perturbations.

HTMNet adopts a two-branch hybrid, with a Transformer stream for global multi-scale features and a LocalMamba stream with circular scanning that gathers local context around potential targets [147]. A feature interaction fusion module combines the outputs so that both global background structure and fine-scale neighbourhood cues influence detection decisions. Across San Diego I/II, Abu-airport-2, and low-contrast Salinas scenes, HTMNet reaches near-saturated AUC and slightly exceeds both a pure Mamba detector (HTD-Mamba) and a Transformer baseline (TSTTD). This indicates that pairing local Mamba recurrence with Transformer-scale context helps in cluttered backgrounds.

HTD-Mamba approaches target detection from a self-supervised perspective [148]. A pyramid Mamba backbone and spatial-encoded spectral enhancement modules generate multiple spectral views for contrastive training, encouraging representations that are stable under spectral variations and effective at modelling background structure. Experiments indicate that such pretraining improves robustness in low-signal or few-label regimes, complementing hybrid architectures like HTMNet.

3.3.3. Hyperspectral Anomaly Detection

In hyperspectral imagery, anomaly detection is concerned with pixels whose spectra deviate from an estimated background model, typically in the absence of an explicit target signature. Performance depends critically on how accurately the background is modelled and reconstructed.

DPMN introduces a deep-prior Mamba network that uses a bidirectional Mamba-based abundance generation module to obtain background representations, coupled with a learnable background dictionary that partitions the background into several subspaces [149]. A regularization term combining total-variation and low-rank constraints enforces spatial smoothness and compactness, making it easier to separate anomalies from structured clutter. MMR-HAD adopts a reconstruction-based strategy with a multiscale Mamba reconstruction network, random masking to reduce the influence of anomalies on background estimation, dilated-attention enhancement, and dynamic feature fusion [150]. On standard anomaly benchmarks, both methods report improved detection accuracy, particularly when anomalies are subtle or densely distributed, compared with RX-type and CNN-based approaches.

3.3.4. Summary

Across unmixing, target detection, and anomaly detection, Mamba is rarely used in isolation. Dual-stream unmixing networks rely on Mamba to propagate information along spectral and spatial dimensions while preserving explicit endmember modelling; progressive sequence models trade a small loss in accuracy for substantial reductions in computational cost. Hybrid target detectors combine Transformer-scale global context with local Mamba scans and benefit from self-supervised pretraining, while anomaly detectors use Mamba-based reconstructions as flexible background models combined with dictionaries, masking strategies, and regularizers.

4. General Visual Perception

High-resolution semantic segmentation, object and change detection, and scene classification underpin many operational Earth observation products. Models must process gigapixel scenes, capture long-range spatial dependencies, and scale across archives and sensors. Mamba backbones provide linear-complexity context modeling and are increasingly inserted into CNN or hybrid networks as drop-in replacements for attention or as dedicated long-range branches.

4.1. Semantic Segmentation

Semantic segmentation assigns a land-cover class to every pixel and is therefore a stringent test for models that must combine fine boundaries with kilometre-scale context. Classical CNN and encoder–decoder architectures, including U-Net–style and pyramid pooling variants, have built strong baselines for EO mapping but struggle to aggregate information over very large tiles without resorting to aggressive downsampling or tiling [47,60,151,152,153,154,155,156]. Transformer-based segmentors extend the receptive field but are often memory-bound on high-resolution aerial and satellite images, which limits the spatial extent or batch size that can be processed in practice [157,158]. Mamba-based segmentation models attempt to retain CNN-like efficiency while adding linear-complexity propagation of long-range cues, and recent work has converged on a small number of design patterns rather than isolated architectures [63,159,160].

4.1.1. Global–Local and Multiscale Architectures

Most Mamba segmentation networks adopt global–local hybrids in which convolutions handle local texture and boundary details, while Mamba branches transport information across downsampled feature maps. Samba and MF-Mamba attach Mamba encoders to CNN feature pyramids, so the SSM state evolves over multi-scale semantic maps rather than raw pixels [63,159,161]. This improves mIoU on Potsdam, Vaihingen, and LoveDA with only modest parameter overhead relative to CNN baselines. PPMamba wraps pyramid pooling modules with Mamba blocks and uses global–local state updates to refine predictions for large buildings and roads without oversmoothing small structures [160,162,163]. FMLSNet and related designs extend this idea by coupling ResNet-style encoders with lightweight Mamba layers, focusing on long-range refinement of large objects while leaving edge sharpening to convolutional decoders [164,165,166,167].

Other works target data efficiency and adaptation. LMVMamba and related models insert lightweight Mamba branches into multi-scale CNN encoders and share parameters across levels so that multi-scale features can be projected into a common linear dimension, which eases training on small labelled sets [153,154,168,169,170]. Multi-scale feature aggregation combined with state-space propagation has been shown to reduce fragmentation in large objects and to stabilise predictions under distribution shifts between cities or acquisition conditions [158,171,172,173,174].

Taken together, these results indicate that Mamba primarily serves as a global context carrier sitting on top of otherwise conventional segmentation stacks. Pure SSM encoders without convolutions remain rare and, on current benchmarks, offer limited evidence of clear benefits over carefully tuned CNNs and CNN–Transformer hybrids.

4.1.2. Spectral–Channel, Multimodal, and Generative Designs

A second line of research exploits Mamba beyond purely spatial modelling by acting along spectral channels, across modalities, or inside generative decoders. Spectral–channel networks such as CPSSNet treat channels as ordered sequences and insert Mamba along the channel dimension, which improves discrimination of classes whose signatures differ mainly in subtle spectral patterns rather than geometry [74,164,175]. Multimodal designs push this idea further. MGF-GCN combines a graph encoder for DSM or LiDAR structure with a Mamba branch for optical imagery, using cross-modality fusion modules to align geometric and radiometric context for urban mapping [176]. MoViM integrates Vision Mamba into paired SAR–optical streams, showing that state-space branches can propagate shared context while leaving modality-specific artefacts to CNN or Transformer sub-networks [177].

Beyond discriminative models, DiffMamba couples CNN–Transformer encoders with diffusion decoders regularised by Mamba-style sequence propagation [178]. In these architectures, Mamba primarily stabilises long-range dependencies inside the generative head and improves the realism of predicted segmentations under heavy clutter or class imbalance, rather than replacing spatial convolutions.

Overall, segmentation results to date support a restrained but positive assessment of Mamba in EO. Global–local hybrids clearly help when tiles are large, classes are highly imbalanced, and infrastructure patterns span long distances, while spectral–channel and multimodal variants extend these gains to multi-band and multi-sensor settings [179,180,181]. At the same time, well-designed CNN or CNN–Transformer segmentors remain strong baselines, and the added complexity of Mamba branches is most defensible when long-range context or cross-modal coupling is demonstrably important.

4.2. Object Detection

Object detection in remote sensing covers oriented ships and vehicles, multi-scale buildings, and very small targets for traffic or aviation monitoring. CNN and Transformer detectors remain strong baselines, but they must trade off ultra-high-resolution inputs, small-object sensitivity, and memory/latency constraints. Recent Mamba-based detectors therefore cast detection as sequence modelling: SSM branches follow scan paths aligned with object geometry to propagate long-range context at near-linear cost.

4.2.1. Oriented and Multimodal Detection

Multimodal detectors for RGB–IR UAV imagery explicitly account for modality-dependent disparities and spatial offsets. Several networks adopt dual branches with mask-guided regularisation and offset-guided fusion so that cross-modal features remain stable under misregistration [182,183,184]. Hybrid CNN–Mamba backbones, as in RemoteDet-Mamba, further encode cross-sensor context and background statistics, improving robustness in cluttered scenes [185]. For hyperspectral data, edge-preserving dimensionality reduction combined with visual Mamba enhances spatial–spectral representations and improves small-object separability [186].

A second line of work inserts Mamba blocks directly into detection pyramids. SSMNet augments the feature pyramid with state-space modules that aggregate information consistently across scales [187]. For small objects in UAV imagery, MV-YOLO introduces hierarchical feature modulation, while YOLOv5_mamba couples bidirectional dense feedback with adaptive gate fusion to refine small-object representations in cluttered scenes [188,189]. Programmable gradients within SSMs have also been exploited in Soar to sharpen small-body detection under scarce or imbalanced data [190].

For oriented detection, OriMamba builds a hybrid Mamba pyramid with a dynamic double head that decouples classification and regression, whereas MambaRetinaNet combines multi-scale convolutions with Mamba blocks to balance global context and local detail [191,192]. Multi-directional scanning strategies further improve infrared object detection by integrating features along several orientations to suppress structured clutter [193]. In SAR ship detection, domain-adaptive state-space modules within a mean-teacher framework support unsupervised cross-domain transfer, complemented by large-strip convolutions and multi-granularity Mamba blocks that capture the elongated context of high-aspect-ratio targets [194,195]. Rotation-invariant backbones such as M-ReDet refine fine-grained features in dense ship clusters and other highly anisotropic scenes [196]. Beyond bounding boxes, context-aware state-space models have been extended to multi-category counting by scanning local neighbourhoods during inference and to single-stream object tracking that maintains localisation in cluttered or forested environments [197,198].

4.2.2. Infrared Small-Target Detection

ISTD requires distinguishing faint, often sub-pixel targets from structured backgrounds. Several U-shaped architectures combine CNN encoders with Mamba blocks so that local detail is preserved while long-range context regularises background clutter. EAMNet introduces an adaptive filter module before Mamba encoding to enhance target visibility [199]. HMCNet and SBMambaNet insert spatial-bidirectional Mamba blocks into hybrid CNN–Mamba encoders to improve suppression of structured background clutter [200,201]. SMILE applies a perspective transform to sparsify the background and uses spiral spectral scanning to learn coupled spatial–spectral features [202], whereas MiM-ISTD introduces a “Mamba-in-Mamba” encoder with nested recurrences across spatial scales [203]. Together, these designs treat Mamba mainly as an efficient background modeller that normalises structured clutter and highlights salient responses.

4.2.3. Salient Object Detection

In optical remote sensing images, salient object detection (SOD) aims to localise the most prominent geospatial targets in complex scenes so as to support subsequent analysis and decision-making [204]. Topology-aware hierarchical Mamba networks impose structural constraints that suppress spurious saliency responses [205]. TSFANet aligns multi-scale features in a Transformer–Mamba hybrid to maintain semantic consistency [206], whereas LEMNet uses edge cues in a lightweight Mamba backbone to mitigate the lack of dense pixel annotations under weak supervision [207].

4.3. Change Detection

Change detection estimates land-cover transitions between multi-temporal images while suppressing pseudo-changes caused by illumination, sensor differences, or registration errors. High-resolution urban and peri-urban scenes add further complexity through small building footprints, thin roads, and non-rigid deformations. A recent review indicates that change detection is moving towards foundation models and efficient long-sequence encoders [208].

4.3.1. Spatiotemporal Interaction Backbones

Spatiotemporal interaction backbones place Mamba at the core of bi-temporal feature fusion. ChangeMamba uses shared Siamese encoders followed by a visual Mamba block that scans concatenated pre- and post-event features, allowing each location to integrate cross-time context at linear cost [209]. CD-STMamba extends this idea with a Spatio-Temporal Interaction Module that encodes multi-dimensional correlations during both encoding and decoding [210]. CD-Lamba introduces a Cross-Temporal Locally Adaptive State-Space Scan (CT-LASS) that is designed to enhance the locality perception of the scanning strategy while maintaining global spatio-temporal context in bi-temporal features [211].

Several methods, such as 2DMCG, KAMamba, ST-Mamba, and SPRMamba, add explicit modules for feature alignment and temporal reasoning. For example, 2DMCG couples a 2D Mamba encoder with change-flow guidance in the decoder to align bi-temporal features and reduce fusion errors caused by spatial misregistration [212]. KAMamba targets long MODIS time series by combining a knowledge-aware transition-matrix loss with sparse deformable Mamba modules to model land-cover dynamics [213]. ST-Mamba introduces a Spatio–Temporal Synergistic Module that maps bi-temporal features into a shared latent space before Mamba propagation, thereby improving background consistency [214]. SPRMamba balances salient and non-salient changes via a saliency-proportion reconciler and squeezed-window scanning [215]. SMNet and LBCDMamba modify the scan pattern and pair Mamba blocks with modules such as RWKV or multi-branch patch attention, with the aim of improving long-range interaction modelling while keeping explicit pathways for local detail [216,217].

4.3.2. Hybrid Convolution–Mamba Architectures

Hybrid architectures retain convolutional blocks for local structure and insert Mamba modules as long-range aggregators. CDMamba interleaves convolutional and Mamba branches via Scaled Residual ConvMamba blocks so that local texture and edges are refined while change cues propagate over larger areas [218]. CWmamba fuses a CNN-based base feature extractor with Mamba to jointly exploit local detail and global context, and Hybrid-MambaCD uses an iterative global–local feature fusion mechanism to merge CNN and Mamba features across scales [219,220]. ConMamba pushes this principle further by building a high-capacity hybrid encoder that deepens the interaction between convolutional and state-space features [221].

Multiscale aggregation is handled explicitly in SPMNet, which adopts a Siamese pyramid Mamba network with hybrid fusion of high- and low-channel semantic features, and in LCCDMamba, whose multiscale information spatio-temporal fusion module aggregates difference information for land-cover change detection [222,223]. MF-VMamba combines a VMamba-based encoder with a multilevel attention decoder to interactively fuse global and local representations, whereas VMMCD targets efficiency with a lightweight design and a feature-guiding fusion module that removes redundancy while preserving accuracy [224,225]. Residual wavelet transforms have been integrated with Mamba to refine fine-grained structural changes and suppress noise [226,227]. Attention–Mamba combinations such as Mamba-MSCCA-Net and AM-CD further enhance hierarchical feature representation, and TTMGNet exploits a tree-topology Mamba to guide hierarchical incremental aggregation [228,229,230]. For unsupervised scenarios, RVMamba couples visual Mamba with posterior-probability-space analysis to detect changes without labelled pairs [231].

To manage multi-scale features more efficiently, a pyramid sequential processing strategy serialises multi-scale tokens into a long sequence and fuses them through Mamba updates [232]. Generative approaches such as IMDCD combine Swin-Mamba encoders with diffusion models, using iterative denoising to refine change maps and reduce artefacts [233]. Collectively, these results suggest that Mamba is most effective when it complements rather than replaces convolution, specialising in coherent long-range aggregation while CNN blocks handle precise localisation.

4.3.3. Alignment-Aware Designs

Geometric misalignment between bi-temporal images is a major source of false alarms. DC-Mamba adopts an “align-then-enhance” strategy: bi-temporal deformable alignment first corrects spatial offsets at the feature level, after which Mamba layers refine change cues [234]. MSA (Mamba Semantic Alignment) instead operates at the semantic level, using a semantic-offset correction block to adjust deeper responses [235]. Building on vision foundation models, SAM-Mamba and SAM2-CD adapt SAM2 encoders to change detection by combining activation-selection gates or Mamba decoders that suppress task-irrelevant variations while sharpening change boundaries [236,237]. These studies underline that state-space dynamics perform best when applied to feature fields that already respect the imaging geometry [234,235,238].

4.3.4. Hyperspectral and Challenging Scenarios

Mamba’s linear complexity is particularly attractive for data-intensive modalities such as hyperspectral imaging and for adverse conditions such as low-light scenes. GDAMamba captures global contextual differences at the image level and enhances temporal spectral discrepancies for hyperspectral change detection [239]. WDP-Mamba introduces a wavelet-augmented dual-branch design with adaptive positional embeddings to better preserve spatial–spectral topology [240]. SFMS couples a tri-plane gated Mamba with SAM-guided priors to stabilise learning for rare classes in hyperspectral change detection [241]. For low-light optical imagery, Mamba-LCD introduces illumination-aware state transitions that amplify weak signals in dark urban regions [242].

4.3.5. Summary

Current Mamba-based change detectors span pure spatiotemporal backbones, hybrid CNN–Mamba networks, and designs that explicitly model cross-time alignment. Across these models, Mamba modules propagate long-range bi-temporal context at roughly linear complexity, while convolutional components refine high-frequency details, precise boundaries, and low-level registration [209,211,219,220,232]. Dedicated alignment blocks—either deformable or foundation-model-based—supply the geometric consistency that state-space dynamics alone do not guarantee [234,235,238].

Overall, Mamba-based CD methods evolve from lightweight SSM backbones toward structured multi-scale Siamese designs, then to explicit spatiotemporal interaction modules, and most recently to foundation-model-assisted pipelines with scalable model-size variants.

To consolidate the above discussion, Table 2 summarises representative WHU-CD results for the Mamba-based detectors reviewed in Section 4.3, together with their reported Params/FLOPs. Figure 5, reproduced from Ref. [210], is included because it makes the qualitative effect of long-range bi-temporal aggregation more transparent, especially in fine structures where boundary localisation and omission errors are most apparent.

Table 2. Representative Mamba-based change detection methods and performance on WHU-CD dataset (Optical), together with the reported model complexity (Params and FLOPs). Symbol definitions and data sources are provided in Appendix A.2. SAM2-CD reports three model scales (S, B, L) corresponding to small, base, and large variants; SAM-Mamba similarly includes three backbone variants (ViT-B, Hiera-B, Hiera-L). N/R means Not Reported.

Figure 5. Qualitative comparison of different methods (CNN-, Transformer-, and Mamba-based) on the WHU-CD test set. (I) A complex change scenario involving both building emergence and disappearance. (II,III) Large-scale building change cases. (IV) Small-scale building changes in a complex setting. White denotes TPs, black denotes TNs, red indicates FPs, and green indicates FNs. Compared with other competitors, CD-STMamba better preserves fine-grained boundary details and reduces missed detections in challenging scenes. Reprinted from Ref. [210].

4.4. Scene Classification

Scene classification assigns semantic labels such as residential, industrial, or farmland to image patches and therefore requires models that capture global layout, multi-scale structure, and label co-occurrence. In this setting, Mamba backbones are mainly used as linear-complexity substitutes for attention, often combined with convolutional modules.

For single-label classification, recent work focuses on how 1D state updates can approximate non-causal 2D structure. RSMamba uses a dynamic multi-path scanning mechanism that mixes forward, reverse, and random traversals and attains F1-scores around 95% on UCM and RESISC45 with fewer parameters than ViT-Base or Swin-Tiny [243]. HC-Mamba couples a local content extraction module with cross-activation between convolutional features and Mamba states, while G-VMamba adds a contour enhancement branch to preserve luminance gradients [244,245]. To handle scale variation and limited labels, MPFASS-Net introduces progressive feature aggregation with orthogonal clustering self-supervision, ECP-Mamba applies multiscale contrastive learning to PolSAR data, and HSS-KAMNet hybridises spectral–spatial Kolmogorov–Arnold networks with dual Mamba branches for fine-grained land-cover identification [246,247,248].

For multi-label scene classification, the main challenge is modeling dependencies between co-occurring categories. MLMamba combines a pyramid Mamba encoder with a feature-guided semantic modeling module that refines class-wise embeddings and their relations, achieving competitive mean average precision on UCM-ML and AID-ML with substantially reduced FLOPs and parameter counts compared with Transformer-based baselines [249]. Overall, current Mamba-based scene classifiers fall into two patterns: multi-path or cross-activation scans for single-label scenes, and pyramid Mamba encoders coupled with semantic-relation modeling for multi-label settings.

5. Restoration, Generation, and Domain-Specific Applications

Visual state-space models are currently most mature in image restoration, where long-range dependencies and flexible scanning address the limits of local filters and quadratic-cost attention. Remote-sensing studies then extend these backbones to multimodal generation, compression, security, and scientific EO applications. This chapter therefore focuses on design patterns and where Mamba genuinely shifts the accuracy–efficiency–robustness trade space, rather than enumerating every variant.

5.1. Image Restoration and Geometric Reconstruction

Image restoration is a standard but critical stage in EO processing chains. Super-resolution, dehazing, denoising, and geometric reconstruction directly influence radiometric consistency, change-detection reliability, and the robustness of downstream products [250,251]. These problems combine local operators dictated by sensor physics and geometry with long-range correlations introduced by illumination, atmosphere, and acquisition layout. In this setting, Mamba-based models are appropriate only when long-sequence propagation plays a central role in the degradation process; otherwise, the additional architectural complexity is unlikely to provide clear benefits over well-engineered CNN- or Transformer-based restorers [30,252].

5.1.1. Super-Resolution

Remote-sensing super-resolution (SR) must sharpen man-made structures and edges at large scale without introducing spectral artefacts or aliasing [53,253]. Existing Mamba-based SR methods follow two broad strategies.

The first emphasises lightweight and hybrid architectures. Rep-Mamba, for instance, couples cross-scale state propagation with re-parameterised convolutions so that SSM layers carry context across large receptive fields while convolutions refine local details [254]. Other works keep a conventional CNN backbone and plug Mamba blocks only into deeper layers or skip connections, which reduces the marginal cost of adopting Mamba and simplifies deployment on existing SR pipelines [255,256,257]. These designs generally deliver modest but consistent PSNR/SSIM gains over pure CNN baselines on optical and infrared SR benchmarks, while reducing FLOPs compared with attention-heavy Transformers.

The second line develops frequency- and physics-aware SR models. Some architectures operate in wavelet or Fourier domains, where Mamba propagates information along multi-scale frequency coefficients rather than pixels, improving reconstruction of repetitive textures and fine structures [54,258,259]. Others embed priors on degradation operators and spectral response. Spectral super-resolution models employ Mamba to couple high-resolution multispectral bands with low-resolution hyperspectral measurements, decomposing the mapping into a physically constrained range space and a learned null space that absorbs residual correlations [260]. Here, the benefit of Mamba is clearest when spectral and spatial correlations interact over long ranges; on simple bicubic upsampling baselines with limited aliasing, CNNs often remain strong competitors.

Overall, SR studies suggest that Mamba is most valuable when SR is part of a broader multi-degradation or cross-sensor pipeline and when token sequences reflect physically meaningful propagation paths, not just flattened patches [53,54,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267].

5.1.2. Atmospheric and Weather Restoration

Remote sensing imagery is frequently degraded by haze, clouds, rain streaks, and complex atmospheric scattering, so atmospheric and weather restoration methods seek to recover clear-sky surface reflectance from such observations [268,269]. In these settings, long-range dependencies arise naturally: scattering patterns and cloud fields evolve smoothly over space and time, and restoration must respect radiometric consistency across large regions.

Mamba-based dehazing and deraining networks typically adopt hybrid encoders: convolutional stages extract local gradients and edges, while Mamba branches propagate information along scanlines or multi-scale windows to model extended haze layers and cloud structures [270,271,272]. Some designs move to transform or polarimetric domains, where Mamba captures correlations among frequency components or scattering channels that are difficult to handle with local filters alone [273,274]. Compared with pure CNN baselines, these hybrids often improve structural similarity and colour fidelity under heavy haze, but the gains shrink on mild degradations where simple residual networks already perform well.

Weather-centric models apply similar ideas to rain, snow, and mixed artefacts, sometimes using recurrent or temporal SSM branches when short time series are available [275,276]. Additional illumination-aware and direction-adaptive dehazing and deraining variants follow the same pattern, combining scan-aware Mamba modules with conventional encoders to regularise large-scale shadow and cloud structures [261,262,264,265,266,267]. Here, Mamba’s linear-time recurrence allows the same kernel to process sequences of varying length without changing architecture, which is appealing for operational systems that must handle irregular revisit intervals.

5.1.3. Denoising and Generalised Restoration

Hyperspectral denoising is a stringent test for restoration models: Gaussian and shot noise, striping, dead lines, and mixed artefacts all appear across hundreds of correlated bands [252,253,277]. In this context, Mamba provides a flexible way to couple spectral and spatial dependencies without quadratic attention.

Several works focus on specialised denoisers. Stripe-adapted and omni-selective scanning variants, for example, rearrange tokens along degradation-aware windows or channel groups so that Mamba updates align with noise patterns [278,279]. Cube-selective and continuous-scanning designs reorder voxels so that Mamba updates follow natural spectral–spatial orderings instead of arbitrary raster scans, reducing the effective sequence length while preserving physical neighbourhoods [280,281,282]. LaMamba combines linear attention with a bidirectional state-space layer and spectral attention, using Mamba to propagate information along long spectral paths while attention modules focus on local band interactions [282]. These models tend to outperform pure CNN and Transformer baselines on heavily degraded HSI benchmarks, particularly when noise patterns vary with wavelength or across sensors.

Beyond specialised denoisers, several architectures tackle generalised restoration where multiple degradations (blur, noise, haze, compression artefacts) are handled within a single network [283]. Spatial-frequency hybrids use CNN encoders to capture local textures, Fourier blocks to separate low- and high-frequency components, and Mamba layers to connect these representations across space. The main empirical conclusion is that Mamba helps when degradations introduce long-range correlations—such as global illumination shifts or structured striping—but brings limited extra value for purely local noise where classical CNN priors already suffice [30,252,280,281,282,283].

5.1.4. Geometric Reconstruction: Stereo and Stitching

Geometric reconstruction tasks—stereo disparity estimation, DEM refinement, parallax-based stitching, and related problems—map image content to geometry and camera motion rather than to radiance [53,251]. They rely heavily on epipolar constraints, multi-view geometry, and sensor models, and thus offer a different test bed for Mamba.

Current work mostly uses Mamba as an auxiliary module inside otherwise geometry-aware systems. MEMF-Net and RT-UDRSIS integrate Mamba branches into multi-scale stereo or stitching pipelines, where SSM layers aggregate long-range matching cues while conventional cost volumes and warping handle geometric consistency [41,271,272,284,285,286]. Empirically, these hybrids can reduce artefacts in weakly textured or repetitive regions and improve robustness to radiometric differences between views. However, they do not replace explicit geometric components such as disparity regularisation or bundle adjustment; instead, they act as learnable priors that refine correspondences and fill gaps.

So far, there is little evidence that replacing entire stereo or SFM pipelines with SSM-only architectures would be beneficial. On the contrary, the strongest results come from tight coupling between geometry-aware modules and lightweight Mamba branches, suggesting that the role of SSMs in geometric reconstruction is to complement, not substitute, established model-based methods [41,271,272,287].

5.2. Vision–Language and Generation

Vision–language models for EO must link long visual sequences to text or symbolic outputs under strict memory and latency constraints. Gigapixel scenes and multi-temporal stacks inflate the number of visual tokens far beyond natural-image VLMs, and self-attention in the visual encoder becomes the main bottleneck. Mamba-style visual state-space backbones provide linear-time sequence processing and task-specific scanning, making them attractive as visual encoders for captioning, alignment, and generative pipelines in EO VLMs [288].

5.2.1. Multimodal Alignment and Captioning

A recent taxonomy decomposes multimodal alignment into data-level, feature-level, and output-level schemes, with the visual encoder carrying most of the burden in remote sensing [289]. To cope with long sequences, Mamba-based captioning models modify how feature maps are serialized so that one-dimensional sequences reflect spatial and temporal structure rather than raw raster order.

RSIC-GMamba uses a Mamba backbone whose scan path over feature maps is optimised via genetic operations [290]. By reordering spatial tokens, it brings semantically related regions closer in the sequence and allows recurrence to integrate context along coherent land-cover patterns, improving CIDEr and BLEU-4 over ViT encoders such as MG-Transformer at similar scales. RSCaMa extends to change captioning with temporal and spatial Mamba branches that serialize pre- and post-event images so that the state jointly encodes within-image structure and cross-time differences, improving the fidelity of descriptions of building construction or land-cover conversion [291,292]. DynamicVis builds a high-resolution visual foundation model with selective state-space modules that focus computation on informative regions in 2048 × 2048 scenes [293]. It reduces latency and memory by over 90% relative to ViT baselines while maintaining competitive captioning and retrieval performance in multimodal settings.

5.2.2. Generative Reconstruction, Compression, and Security

Mamba backbones also appear in generative pipelines where global structure affects rate–distortion trade-offs and robustness. Pan-Mamba tackles pan-sharpening by jointly processing high-resolution panchromatic and lower-resolution multispectral inputs within one backbone [294]. State updates propagate structural cues from the panchromatic channel while preserving multispectral consistency, improving edge sharpness and spectral fidelity in urban scenes. VMIC targets learned compression and replaces CNN hyperpriors with cross-selective scan blocks based on visual SSMs [295]. Bidirectional 2D scans expose long-range spatial and channel correlations to the entropy model, yielding BD-rate reductions of roughly 4–10% over the VTM reference codec without comparable complexity growth. Dimba combines Transformer and Mamba layers for text-to-image diffusion, trading some attention overhead for higher throughput and lower memory usage during sampling [296].

In security, Mamba is often a teacher rather than a target. DMFAA distils features from a Mamba backbone into adversarial perturbations that are then applied to diverse student models, including CNNs and Transformers [297]. On AID, UCM, and NWPU, DMFAA increases attack success rates by 3–13% over state-of-the-art black-box attacks under the same perturbation budget. This suggests that Mamba’s long-range dependencies provide additional degrees of freedom for transferability and offer a stringent stress test for RS recognition systems.

5.3. Domain-Specific Scientific Applications

Beyond generic benchmarks, Mamba-based architectures now appear in scientific EO applications where geometry, dynamics, and multi-sensor constraints dominate classic vision metrics. Across agriculture, disaster response, marine monitoring, and meteorology or infrastructure, a common pattern emerges. CNN or attention branches capture local texture and sensor physics, whereas Mamba modules propagate information along long spatial or temporal trajectories that follow crop phenology, flood evolution, road networks, or ocean currents.

5.3.1. Agriculture and Forestry

In precision agriculture and forestry, crop and canopy dynamics evolve over long temporal sequences, which pushes models to exploit extended time series. SITSMamba combines a CNN spatial encoder with a Mamba temporal encoder to model multi-year crop phenology as long sequences with position-weighted reconstruction [57]. STSMamba further addresses temporal–spectral coupling in MODIS time series through sparse deformable token sequences [298]. MSTFNet fuses improved Mamba modules with dual Swin-Transformers for hyperspectral precision agriculture [299]. At object level, YOLO11-Mamba adds Efficient Mamba Attention to YOLO11 for maize emergence from UAV imagery [300]. CMRNet uses a hybrid CNN–Mamba backbone to enhance semantic features for rapeseed counting and localisation [301], and Succulent-YOLO integrates Mamba-based SR with CLIP-enhanced detection for succulent farmland monitoring [302]. RSVMamba for tree-species classification and EGCM-UNet for parcel boundary delineation follow the same template: CNN branches describe plant morphology and edges, whereas Mamba branches preserve parcel-level continuity and context [303,304].

Monitoring in precision agriculture and forestry is moving toward higher cadence observation, most notably from UAV video streams. Spaceborne video is emerging, but its use in crop and canopy monitoring remains limited. VideoMamba suggests that selective state space models can support spatiotemporal modelling [305]. Dedicated studies in remote sensing remain scarce, which leaves clear room for future work.

5.3.2. Disaster Assessment and Emergency Response

Operational disaster assessment and emergency response are constrained by limited latency and resources, which places efficiency and robustness at the centre of model design. EMA-YOLOv9 augments YOLOv9 with Efficient Mamba Attention for real-time fire detection [306]. Flood-DamageSense adopts a multimodal design for flood assessment, fusing SAR/InSAR, optical imagery, and risk layers via a Feature Fusion State-Space module that projects them into a shared representation [307,308]. For geological hazards, SegMamba2D uses a lightweight encoder–decoder with Mamba modules to balance global context with local features for landslide mapping [309]. Mamba-MDRNet integrates pre-trained large language models with Mamba mechanisms to select reliable modalities for natural-disaster scene recognition [310]. LinU-Mamba for wildfire spread and C2Mamba for building change similarly use Mamba blocks to propagate long temporal or spatial links, while CNN and attention components manage local detail and cross-sensor alignment [311,312].

5.3.3. Marine Environment and Water Resources

Marine and inland-water applications are naturally suited to state-space modeling because structures are elongated, boundaries diffuse, and labels sparse. OSDMamba handles SAR-based oil-spill detection with an asymmetric decoder integrating ConvSSM components and deep supervision [313]. Algae-Mamba incorporates Kolmogorov–Arnold networks into a visual state space for algae extraction [314], and a synergistic fusion framework uses Mamba-based coral reef habitat classification to refine satellite-derived bathymetry [315]. OWTDNet detects offshore wind turbines with a dual-branch CNN–Mamba architecture in which a lightweight CNN branch captures turbine signatures and a Mamba branch encodes large-scale ocean context before alignment [316]. STDMamba models sea-surface-temperature time series with temporal convolutions and bidirectional Mamba2 modules, while MMamba uses mutual-information-based feature selection and a Mamba reconstruction module for wind-speed gap filling [317,318].

5.3.4. Meteorology and Infrastructure

Road networks, meteorological fields, and environmental variables share elongated structures, smooth spatial patterns, and strong links to terrain. TrMamba steers its scan along predicted road directions in high-resolution imagery and applies selective Mamba updates along these paths to delineate road networks [319]. FDMamba employs a frequency-driven dual-branch structure to capture fine-grained edge details for road extraction [320], and other hybrids explicitly integrate multi-task learning to refine road-network topology [321]. Mamba-UNet applies selective state-space modeling to precipitation nowcasting via a dual-branch fusion module with multiscale spatiotemporal attention [322]. MCPNet uses an asymmetric Mamba–CNN collaborative architecture to balance memory usage and global modeling for large-scene segmentation [323]. Mambads for terrain-aware downscaling, BS-Mamba for black-soil degradation, and kMetha-Mamba for methane plume segmentation share a common design: CNN components encode local structure and process-related features, and Mamba components propagate information along spatial networks or temporal trajectories [324,325,326].

To help readers navigate the rapidly growing literature, Table 3 provides a compact, task-wise roadmap of Mamba-style EO models. Rather than repeating the full taxonomy, we summarise the dominant design inflection points within each domain—from early drop-in SSM backbones to scan/sequence redesign and hybrid integration—together with representative examples discussed in this survey. This table is intended as a quick reference to support cross-domain comparison and to clarify how architectural choices have evolved across different EO tasks.

Table 3. Task-wise evolutionary trajectories of Mamba-based EO models, dominant inflection points and representative works.

6. Advanced Frontiers & Future Directions

This section concentrates on questions that remain open before Mamba-style state-space models can be regarded as mature tools for Earth observation. Rather than listing generic “future work”, we focus on five cross-cutting themes: (i) theoretical validity and task regimes, (ii) hardware-aware deployment, (iii) physics-informed designs, (iv) remote-sensing foundation models and scaling behaviour, and (v) green computing, efficiency, and reproducibility.

6.1. Theoretical Substrates and Task Validity

We first revisit the theoretical substrates of visual state-space models and ask in which EO regimes their extra recurrence is justified.

6.1.1. Task Regimes and the Limits of SSMs

The central question is not whether state-space models “work” on benchmarks but for which EO problems their extra recurrence is worth the cost. MambaOut addresses this directly by comparing Mamba-style backbones with gated CNNs and defining a characteristic sequence length

τ

that scales with channel dimension [327]. On short sequences far below this length, such as ImageNet classification with

224 \times 224

inputs, gated CNNs match or slightly exceed VMamba accuracy while using fewer parameters and FLOPs [327]. These results, together with S4 studies showing that structured state-space layers are most advantageous on sequences of thousands of steps [12], suggest that SSM blocks are often unnecessary for short-range vision tasks.

EO workloads, however, range from small tiles to gigapixel images and multi-year time series. Patch-wise scene classification on

256 \times 256

crops or detectors with narrow receptive fields rarely reach token counts where SSMs help, and robust CNNs or light CNN–Transformer hybrids are usually easier to optimise and deploy. By contrast, dense prediction on very large scenes, long satellite image time series, or multi-sensor fusion with long correlation lengths resemble the long-range settings where structured SSMs and recent visual variants such as Spatial-Mamba have demonstrated clear benefits [12,328]. For such regimes, linear-time state updates and scan-aware design are more likely to translate into real efficiency and accuracy gains.

6.1.2. Structured SSMs, Mamba-3, and the Linear-Attention Frontier

Beyond the basic single-input SSM formulation, recent variants introduce multi-dimensional, structured, and hardware-aware parameterisations that fit visual and EO workloads more directly. Sparse Mamba imposes controllability, observability, and stability constraints on the state-transition matrix and promotes sparsity, reducing parameter counts and training time without loss of language-model perplexity [329]. Mamba-3 introduces multi-input–multi-output updates and richer recurrence patterns with explicit accuracy–efficiency trade-offs under fixed inference budgets [15]. Visual State Space Duality (VSSD) adapts causal SSD mechanisms to non-causal vision data by discarding state–token interaction magnitudes, enabling bidirectional context while retaining efficient scan implementations [330]. These developments suggest that operator structure, not just asymptotic complexity, is central for performance and stability.

Analyses of linear-attention Transformers, such as RoMA, suggest that structured SSMs and linear-attention ViTs lie in a shared design space. Both achieve linear memory scaling; they differ mainly in kernel parameterisation, implicit priors, and hardware efficiency. For EO, the relevant question is therefore not “SSM or Transformer?”, but which linear-time operator best matches the sensing geometry, dynamics, and compute budget of a given task. On GPUs with highly optimised FlashAttention kernels, a well-implemented linear attention block may be more efficient than a naive Mamba layer, whereas on memory-limited edge devices the fixed-size state update can be preferable. Systematic comparisons under controlled budgets are still rare, especially for remote-sensing workloads, and represent an immediate research gap.

6.2. Hardware-Aware Deployment

We next consider how Mamba-based models behave under the size–weight–power constraints of EO platforms.

RTMamba demonstrates that visual Mamba backbones can be tailored to edge UAV platforms by replacing self-attention with state-space blocks in a semantic segmentation encoder and discarding redundant high-resolution features before expensive processing [329]. These models achieve real-time throughput on embedded GPUs/NPUs with accuracy comparable to heavier Transformer baselines [331,332]. Similar state-space encoders have also been applied to event-based vision for space situational awareness, where low-latency processing of asynchronous streams is essential.

At orbital altitudes, EdgePVM adopts a parallel Siamese Vision Mamba architecture for on-board change detection in serverless satellite–edge constellations [333]. On a Jetson AGX + Jetson Orin NX platform, it reports end-to-end latency reductions from 33.338 ms (single GPU) to 28.186 ms (4 GPUs, 1.57×) and 25.194 ms (8 GPUs, 1.73×), while the measured average power increases from 90.4 W to 115.4 W and 149.09 W, respectively—explicitly quantifying the latency–power trade-off under SWaP constraints. These deployment-facing results indicate that multi-stream SSM designs can move change detection closer to the sensor, producing actionable change products directly on board. Only compact change products need to be downlinked, rather than full-resolution imagery.

These examples align with broader trends in 6G and edge intelligence and suggest that claims of scalability for EO architectures should increasingly be supported by evidence of on-board or near-sensor deployment [334].

6.3. Physics-Informed State-Space Models

Most Mamba-based EO models are currently trained as generic sequence learners, with only loose links to the partial differential equations that govern geophysical processes. The state-space formalism, however, sits naturally alongside recent advances in scientific machine learning, where physics-informed neural networks and neural-operator architectures seek to embed differential constraints directly into the learning process [335,336,337]. These developments suggest that Mamba-style SSMs need not remain purely data-driven encoders, but can be used as learnable discretisations of physical dynamics. In what follows, we discuss two complementary directions: viewing Mamba-style SSMs as neural operators for geophysical dynamics, and coupling them with classical numerical solvers through residual or subgrid parameterisations.

6.3.1. SSMs as Neural Operators for Geophysical Dynamics

Work bridging dynamical-systems modelling and machine learning shows that Mamba-like architectures can act as neural operators for nonlinear dynamics. On chaotic benchmarks, they deliver competitive or better long-horizon stability and extrapolation than Transformer-based and classical neural-operator baselines [336,337,338]. In meteorology and hydrology, spatial–temporal Mamba variants such as Mamba-ND, MetMamba, and RiverMamba learn mappings from multivariate reanalysis or forecast fields to future atmospheric states, regional forecasts, or global river discharge, while retaining linear-time complexity in sequence length and demonstrating gains over attention-based and physics-only reference models [339,340,341]. These studies treat the SSM update as a learnable time-stepping scheme whose parameters capture underlying dynamics, rather than as a generic sequence model.

Analogous opportunities exist in EO. Applications such as ocean-colour retrieval, SAR/InSAR time-series analysis, and radiative-transfer inversion could, in principle, benefit from state matrices whose structure reflects diffusion, advection, elasticity, or energy-balance constraints instead of being left fully unconstrained. At present, however, most EO SSMs still learn generic state matrices. Replacing these parameterisations with designs that encode conservation laws or stability criteria, and evaluating their behaviour under distribution shift, is a concrete and overdue research direction.

6.3.2. Coupling SSMs with Classical Solvers

A complementary direction is to couple SSMs with numerical solvers rather than learning full dynamics from scratch. In these hybrid designs, Mamba does not replace PDE solvers; it learns residual tendencies. Examples include subgrid closures for coarse-resolution climate simulations and correction terms for flood or fire propagation, implemented in a time-marching form that matches the state-space update. Hybrid strategies of this kind have already proved effective with other network families, from physics-informed neural networks and deep-learning-based subgrid parameterisations to neural general circulation models and machine-learning corrections of analytical or reduced-order models [335,342,343,344,345]. Recent work on next-generation Earth system models argues that reliable weather and climate prediction will increasingly rely on such AI–physics hybrids rather than purely data-driven surrogates [346].

Sui et al. provided a concrete validation by introducing RTM-grounded supervision [269], using the 6S radiative transfer model to synthesise physically consistent top-of-atmosphere reflectance for training a U-shaped Dual Attention Vision Mamba Network for satellite dehazing. This evidence motivates an operational coupling in which an RTM defines the physics-based update, while a Mamba block learns residual corrections:

x_{k + 1} = S_{R T M} (x_{k}; y, g e o m) + Δ_{θ} (x_{k}, y, g e o m),

where

S_{R T M}

enforces forward consistency and

Δ_{θ}

parameterises model-mismatch and high-frequency residuals.

State-space architectures still lack a systematic framework for (i) choosing state dimensions that correspond to physical modes, (ii) regularising parameters to enforce conservation and stability, and (iii) evaluating long-horizon stability and mass/energy conservation against numerical baselines. Progress along these lines will determine whether physics-informed SSMs and Mamba-style operators remain proof-of-concept demonstrations or become components of operational geoscience models.

6.4. Remote-Sensing Foundation Models and State-Space Multimodal Alignment

Remote-sensing foundation models (RSFMs) amortise large-scale pretraining across tasks and sensors, but most current designs remain ViT-based and inherit the quadratic complexity of attention. Mamba-based alternatives aim to retain representational power while relaxing quadratic scaling once long time series, multispectral stacks, or large tiles are used in pretraining, and recent reviews summarise the rapidly growing literature [347,348].

6.4.1. Transformer-Based Remote-Sensing Foundation Models

Early RSFMs adapt masked autoencoding and continual pretraining from natural images to satellite data with ViT encoders. SatMAE extends MAE to temporal and multispectral Sentinel-2 imagery by jointly masking spatial patches, spectral bands, and time steps, and trains large ViT backbones to reconstruct the missing content, yielding strong downstream performance on classification and segmentation [349]. However, attention remains the dominant cost and in practice constrains tile size and temporal depth.

Scale-MAE introduces a scale-aware MAE in which multi-resolution patches are tokenised and jointly masked so that the encoder must infer cross-scale relations, improving scale invariance but leaving attention complexity unchanged [350]. GFM follows a continual-pretraining pipeline: it trains a ViT on ImageNet-22k and then adapts it to geospatial corpora, such as GeoPile, using teacher–student distillation and progressive fine-tuning [351]. This reduces effective pretraining time relative to SatMAE-style training, but the model still inherits quadratic scaling with sequence length. Collectively, these Transformer-based RSFMs demonstrate that MAE-style objectives transfer well to EO but leave scaling to longer sequences and larger scenes largely unresolved.

6.4.2. Mamba-Based RSFMs and High-Resolution Backbones

Mamba-based RSFMs replace attention with state-space layers while keeping objectives similar to SatMAE or GFM. SatMamba replaces the ViT encoder with a visual SSM backbone within the MAE framework. It reports comparable or slightly better performance than ViT-MAE on fMoW and faster reconstruction-loss convergence, but it uses more parameters and greater depth, which complicates like-for-like budget comparisons [26,350]. DynamicVis is designed for 2048 × 2048 optical tiles [293]. It combines selective state-space modules with dynamic region perception to focus computation on informative regions, achieving order-of-magnitude reductions in latency and memory relative to ViT backbones at similar accuracy. RoMA introduces rotation-aware self-supervised pretraining to obtain orientation-robust representations, an important property for nadir and off-nadir views [27]. RingMamba moves toward multi-sensor pretraining by coupling optical and SAR streams through scan-and-scan-couple modules and a mixture of generative and contrastive losses [28]. Geo-Mamba integrates dynamic, static, and categorical geophysical factors into a unified state-space framework, often combined with Kolmogorov–Arnold networks for regression heads, thus linking RSFMs and geophysical forecasting [352]. VMIC uses cross-selective scan Mamba modules as priors in learned image compression and achieves BD-rate reductions of roughly 4–10% over the VTM standard on remote-sensing imagery, indicating that SSM-based priors can improve rate–distortion performance in bandwidth-limited settings [295]. Across these models, parameter counts typically range from roughly 80 M to 350 M [26,27,28].

6.4.3. Scaling Laws and Fair Evaluation

Despite this activity, evidence on scaling behaviour in Mamba-based RSFMs remains fragmentary. Most published comparisons differ simultaneously in pretraining datasets, augmentation schemes, and compute budgets, making it difficult to attribute gains to architectural choices. SatMamba, for instance, borrows the ≈800-epoch schedule of SatMAE on fMoW but uses a deeper and larger encoder [26,350], while RingMamba and related models vary backbone width and multi-sensor composition [28]. A critical next step is to design controlled studies in which ViT-MAE and Mamba-MAE encoders are trained under matched token counts, parameter budgets, and optimisation settings and evaluated on the same suite of downstream tasks. Such benchmarks should explicitly report not only task accuracy but also wall-clock time, GPU hours, memory footprint, and performance as a function of context length; without these ingredients, claims about the superiority of one family over the other remain anecdotal.

6.4.4. Multimodal Alignment via State Modulation

Mamba-based RSFMs also invite a different view of multimodal fusion. Whereas ViT architectures typically rely on explicit cross-attention between modalities, many state-space models coordinate modalities through interactions in the hidden state. M³amba couples hyperspectral imagery and LiDAR via a Cross-SS2D module in which the hidden state of one modality modulates the recurrent dynamics of the other, enabling fusion with linear complexity [128]. MFMamba adopts a dual-branch encoder for optical and DSM inputs and introduces an auxiliary Mamba pathway that conditions the optical branch through state modulation [61]. AVS-Mamba extends this design philosophy to audio–visual segmentation with selective temporal and cross-modal scanning [353]. These results indicate that SSM-based cross-modal fusion can be implemented via state modulation rather than token-to-token attention [354]. This approach is well matched to EO settings with heterogeneous resolutions, revisit rates, and noise, and aligns with recent calls for robust cross-modal interaction mechanisms.

6.4.5. Summary

Transformer-based remote-sensing foundation models (RSFMs) underpin most current systems and set strong baselines on public benchmarks, but their quadratic attention limits the context length and image size that can be processed efficiently. Mamba-based RSFMs use similar parameter budgets yet achieve linear complexity in sequence length, so they scale more naturally to long sequences and high resolutions, while state modulation offers a convenient mechanism for multimodal fusion. The main open questions concern how these benefits materialise under matched training budgets and whether state-space designs continue to scale favourably at the billion-parameter level. To make these differences concrete, Table 4 summarises representative Transformer- and Mamba-based RSFMs in terms of backbone choice, supported modalities, and pretraining objectives, providing a quick reference for model design and comparison.

Table 4. Representative Transformer- and Mamba-based remote-sensing foundation models.

6.5. Green Computing, Efficiency, and Reproducibility

In many Earth observation papers, Mamba-based architectures are compared mainly by accuracy, FLOPs, and nominal memory cost. This can be misleading, because practical efficiency is often dominated by constant factors, memory traffic, and kernel utilization rather than asymptotic complexity. Many visual SSM variants adopt multi-directional scans followed by feature fusion. Each scan direction typically repeats a full state update pass over the tokens, so the cost tends to grow with the number of directions, and fusion adds further overhead. When sequences are short, these constants cannot be amortized, so linear time recurrence may not translate into lower wall clock time. MambaOut reports that common vision settings can fall into a short token regime where gated convolutional mixing is more efficient [327]. Meanwhile, FlashAttention and FlashAttention 2 improve practical attention efficiency by structuring the computation to maximize parallelism and on-chip data reuse, which reduces memory reads and writes and pushes hardware utilization closer to its ceiling [355,356,357]. These advances do not remove the quadratic nature of attention, but they can substantially narrow the wall clock gap to current Mamba implementations. Therefore, efficiency claims should be scoped by sequence length, scan design, precision, hardware, and implementation, and should report throughput, latency, and peak memory alongside FLOPs.

Numerical robustness constitutes a critical implementation challenge. Due to the recurrent nature of selective state space models, training is inherently sensitive to state transition parameterization and mixed-precision arithmetic. In long-context scenarios—typical of Earth observation—accumulated state updates can precipitate magnitude amplification and gradient instability, rendering optimization brittle to hyperparameter settings. The restricted dynamic range of low-precision formats (e.g., BF16 or FP16) further exacerbates overflow and underflow risks [358]. Consequently, prevalent implementations often enforce FP32 precision for the selective scan kernel to mitigate numerical divergence, albeit at the cost of increased memory traffic and reduced computational efficiency [359]. While specific variants like Sparse Mamba have introduced structural constraints to the Mamba 2 architecture to enhance stability [329], it remains imperative for remote sensing studies to explicitly report stability-oriented configurations and ablations, rather than relying solely on nominal complexity guarantees.

Recent progress on Mamba efficiency is closely tied to hardware-aware implementations of selective scan, including fused CUDA kernels and scan operators designed to better match GPU execution. In vision, this trend also motivates hardware-aware two-dimensional scan operators [14,15,328,360]. Taken together, efficiency and robustness should be discussed as regime-dependent properties that depend on sequence length, scan design, precision, and kernel-level engineering.

Energy and carbon accounting introduce additional uncertainty. Reported emissions depend on runtime, hardware characteristics, and regional electricity mix, and they can vary substantially across laboratories even for the same model [361,362,363]. A practical compromise is to pair energy-aware reporting with reproducible, regime-specific benchmarks, so that accuracy efficiency trade-offs can be independently verified under transparent assumptions [364].

Reproducibility also benefits from transparent reporting of the full evaluation protocol, including hardware, precision, batch size, sequence length, scan configuration, and profiling methodology. Whenever possible, releasing code and measurement scripts is the most direct way to support reproducibility for wall clock time and energy. If release is constrained, detailed configurations, environment specifications, and fixed checkpoints can still enable meaningful independent verification.

6.6. Summary

To consolidate the design space reviewed in Section 2, Section 3, Section 4, Section 5 and Section 6, Table 5 provides a compact cross-variant comparison of representative Mamba-style designs in EO. It summarises where each variant is illustrated in this survey, its token-mixing complexity, the typical form of accuracy evidence reported, data requirements, and the EO tasks where it is most commonly applied.

Table 5. Systematic but compact comparison of representative Mamba-style variants in EO.

7. Conclusions

Mamba-based visual state-space models introduce linear-time, content-dependent recurrence over long visual, spectral, and temporal sequences. In EO, this mechanism is not a drop-in backbone replacement. It acts as a structured operator whose effect depends on how data are serialised, how it is combined with convolutions and attention, and how physical priors and efficiency constraints are built into the architecture.

Three conclusions emerge.

The natural regime for Mamba in EO is long-context, high-throughput modelling. Across hyperspectral analysis, multi-source fusion, dense perception, and restoration, the most convincing gains appear when models must maintain global context over very large tiles, long image time series, or high-dimensional spectral–spatial sequences under realistic memory budgets. In these regimes, linear-time recurrence allows global dependencies to be modelled without the quadratic overhead of full attention, provided that scanning reflects sensing geometry rather than treating EO data as arbitrary one-dimensional tokens. For patch-wise classification, shallow detectors, and other short-sequence tasks, the evidence points in a different direction. Studies such as MambaOut report that gated CNNs can match or exceed Mamba at modest token lengths while using less compute, so Mamba is not automatically the most economical choice [325]. Treating state-space layers as a universal upgrade is therefore difficult to justify empirically; their use should be argued from sequence length, context requirements, and deployment constraints.
Current EO practice favours scan-aware hybrid architectures over purely SSM backbones. The strongest systems rarely rely on Mamba alone. CNN–Mamba and Transformer–Mamba hybrids leverage complementary inductive biases. Convolutions and local attention handle edges, textures, registration errors, and sensor noise at high resolutions, whereas Mamba branches propagate information along designed spatial, spectral, temporal, or multimodal trajectories. This division of labour underpins RS-Mamba, RS3Mamba, SITSMamba, CSFMamba, VmambaIR, and many domain-specific networks for HSI classification, HSI–LiDAR fusion, very-high-resolution segmentation, and spatiotemporal reconstruction. In such hybrids, scan design is an explicit modelling decision, not an implementation detail. Centre-focused spirals, cross or omnidirectional scans, graph/superpixel-guided traversals, and transform-domain paths each encode assumptions about where relevant context lies and how it should be propagated. In practice, robust designs treat Mamba as a flexible context engine inserted at stages where tokens already summarise larger receptive fields or long time windows, rather than as a wholesale replacement for convolutions or attention.
Physics-aware SSMs and Mamba-based foundation models define the next phase, but they must be held to higher standards of stability, efficiency, and reproducibility. Because state-space layers are rooted in dynamical systems, they are natural candidates for physics-informed EO modelling, including long-horizon forecasting, spatiotemporal downscaling, and inverse problems where conservation laws, radiative transfer, or motion models provide strong priors. At the same time, the community is moving toward large, multimodal foundation models in which Mamba-based encoders are combined with contrastive, generative, or instruction-tuned objectives on global EO archives. Works such as SatMAE, Scale-MAE, GFM, SatMamba, RoMA, and RingMamba indicate that Mamba can act as the visual backbone of such systems when attention becomes prohibitively expensive, while still enabling alignment across sensors and tasks. The challenge now is not simply to scale these models but to do so with explicit analyses of numerical stability, calibration, and energy use, and with open checkpoints and code so that accuracy–efficiency trade-offs can be independently verified.

On this basis, it would be premature to claim that Mamba will become the dominant backbone for EO. The available results support a more nuanced position. Mamba-style SSMs are already indispensable in certain high-context regimes, but they coexist with—and often rely on—carefully engineered convolutional and attention modules. More broadly, state-space thinking is reshaping how EO problems are formulated. Sequences are defined along physically meaningful scan paths, long-range couplings become design primitives, and hybrid backbones are judged not only by benchmark accuracy but also by geometric fidelity and deployment budgets.

Looking forward, progress will depend as much on disciplined experimental practice as on new architectures. Systematic benchmarks that vary sequence length, scan strategy, and hardware platform are needed to delineate when SSMs are genuinely advantageous. Physics-informed SSMs should be evaluated against strong baselines in data assimilation, downscaling, and hazard monitoring, not only on generic vision datasets. Foundation models that build on Mamba must report scaling behaviour, robustness across regions and sensors, and carbon and energy footprints alongside task performance. If these conditions are satisfied, Mamba-style architectures can become core components in EO pipelines. Their appeal is not fashion, but structured recurrence that—used judiciously—addresses concrete limitations of CNNs and Transformers under high spatial resolution, high spectral dimensionality, and dense temporal sampling.

Author Contributions

Z.L.: Conceptualization, Investigation (literature search and analysis), Writing—original draft, Visualization. L.Z.: Validation, Writing—review & editing. Y.L.: Validation, Visualization, Writing—review & editing. Y.M.: Validation, Writing—review & editing. G.L.: Conceptualization, Supervision, Project administration, Funding acquisition, Writing—review & editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Earth Observation Data Center (China), grant number 2024YFB3908404-03. The APC was funded by the same grant.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BEV	Bird’s Eye View
CNN	Convolutional Neural Network
DEM	Digital Elevation Model
DSM	Digital Surface Model
EO	Earth Observation
FLOPs	Floating Point Operations
GCN	Graph Convolutional Network
GPU	Graphics Processing Unit
HSI	Hyperspectral Imaging/Image
InSAR	Interferometric Synthetic Aperture Radar
ISTD	Infrared Small-Target Detection
KAN	Kolmogorov–Arnold Network
LiDAR	Light Detection and Ranging
LLM	Large Language Model
LTI	Linear Time-Invariant
MAE	Masked Autoencoder
mIoU	mean Intersection over Union
MSI	Multispectral Imaging/Image
NPU	Neural Processing Unit
PDE	Partial Differential Equation
PSNR	Peak Signal-to-Noise Ratio
RGB	Red, Green, Blue
RMSE	Root Mean Square Error
RS	Remote Sensing
RSFM	Remote-Sensing Foundation Model
RTM	Radiative Transfer Model
SAM	Segment Anything Model
SAR	Synthetic Aperture Radar
SFM	Structured feature matching
SITS	Satellite Image Time Series
SOD	Salient Object Detection
SR	Super-Resolution
SSIM	Structural Similarity Index Measure
SSM	State-Space Model
UAV	Unmanned Aerial Vehicle
ViT	Vision Transformer
VLM	Vision-Language Model
VSSD	Visual State Space Duality
YOLO	You Only Look Once

Appendix A

Appendix A.1. Geometry-Based Indices for Quantitative Comparison of Scan Mechanisms

In this survey, we use simple traversal-based indices as reproducible proxies to quantify how a scan path preserves local spatial continuity. These indices are not intended as community-standard evaluation metrics, but as a lightweight way to support quantitative discussion of scan designs.

For an

H \times W

tile, a scan serialises the 2D grid into a sequence of length

N = H W

. Let the 2D coordinate of the token at sequence index

t

be

π (t) = (x_{t}, y_{t}), t = 1, \dots, N .

We define the Euclidean step distance between consecutive tokens as

d_{t} = ∥ π (t + 1) - π (t) ∥_{2} = \sqrt{(x_{t + 1} - x_{t})^{2} + (y_{t + 1} - y_{t})^{2}}, t = 1, \dots, N - 1 .

We report the following three quantities:

1.: Mean step distance $\bar{d}$

\bar{d} = \frac{1}{N - 1} \sum_{t = 1}^{N - 1} d_{t} .

Meaning: an average measure of local continuity. Values closer to 1 indicate that the scan predominantly moves between spatially adjacent pixels.

2.: Maximum jump $d_{m a x}$

d_{m a x} = \underset{1 \leq t \leq N - 1}{m a x} d_{t} .

Meaning: the worst-case discontinuity along the scan. Larger

d_{m a x}

indicates that the scan contains at least one large spatial jump, such as row/column boundaries in raster-like traversals.

3.: Jump ratio $J$

J = \frac{1}{N - 1} \sum_{t = 1}^{N - 1} I (d_{t} > 1),

where

I (\cdot)

is the indicator function.

Meaning: how often non-local moves occur.

J = 0

means every step is a unit move on the grid, whereas larger

J

means the scan breaks local adjacency more frequently.

Appendix A.2. Notes on Table 2, Symbol Definitions and Data Sources

N/R (Not Reported): The paper did not provide Params/FLOPs under any explicit configuration.
Data sources: All Params and FLOPs are extracted verbatim from the complexity tables of the cited papers (e.g., SPMNet Table 2, VMMCD Table 2, SAM2-CD Table 3, GSSR-Net Table 4, Mamba-LCD Table 4, SMNet Table 2, CD-STMamba Table 4). No re-profiling was performed for consistency.

References

Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Li, Y.; Zhang, H.; Xue, X.; Jiang, Y.; Shen, Q. Deep learning for remote sensing image classification: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1264. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, L.; Du, B. Deep learning for remote sensing data: A technical tutorial on the state of the art. IEEE Geosci. Remote Sens. Mag. 2016, 4, 22–40. [Google Scholar] [CrossRef]
Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep learning-based classification of hyperspectral data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2094–2107. [Google Scholar] [CrossRef]
Kampffmeyer, M.; Salberg, A.B.; Jenssen, R. Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2016, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 680–688. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Paheding, S.; Saleem, A.; Siddiqui, M.F.H.; Rawashdeh, N.; Essa, A.; Reyes, A.A. Advancing horizons in remote sensing: A comprehensive survey of deep learning models and applications in image classification and beyond. Neural Comput. Appl. 2024, 36, 16727–16767. [Google Scholar] [CrossRef]
Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.S.; Khan, F.S. Transformers in remote sensing: A survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
Fichtl, A.M.; Bohn, J.; Kelber, J.; Mosca, E.; Groh, G. The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures. arXiv 2025, arXiv:2510.05364. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2022, arXiv:2111.00396. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 11 April–10 May 2024. [Google Scholar]
Dao, T.; Gu, A. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. arXiv 2024, arXiv:2405.21060. [Google Scholar] [CrossRef]
Anonymous. Mamba-3: Improved Sequence Modeling Using State-Space Systems. OpenReview. Available online: https://openreview.net/forum?id=HwCvaJOiCj (accessed on 6 January 2026).
Bao, M.; Lyu, S.; Xu, Z.; Zhou, H.; Ren, J.; Xiang, S.; Li, X.; Cheng, G. Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook. arXiv 2025, arXiv:2505.00630. [Google Scholar] [CrossRef]
Xu, R.; Yang, S.; Wang, Y.; Cai, Y.; Du, B.; Chen, H. Visual mamba: A survey and new outlooks. arXiv 2024, arXiv:2404.18861. [Google Scholar]
Rahman, M.M.; Tutul, A.A.; Nath, A.; Laishram, L.; Jung, S.K.; Hammond, T. Mamba in vision: A comprehensive survey of techniques and applications. arXiv 2024, arXiv:2410.03105. [Google Scholar] [CrossRef]
Liu, X.; Zhang, C.; Huang, F.; Xia, S.; Wang, G.; Zhang, L. Vision mamba: A comprehensive survey and taxonomy. IEEE Trans. Neural Netw. Learn. Syst. 2025. early access. [Google Scholar] [CrossRef]
Zhang, H.; Zhu, Y.; Wang, D.; Zhang, L.; Chen, T.; Wang, Z.; Ye, Z. A survey on visual mamba. Appl. Sci. 2024, 14, 5683. [Google Scholar] [CrossRef]
Patro, B.N.; Agneeswaran, V.S. Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges. Eng. Appl. Artif. Intell. 2025, 159, 111279. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Hatamizadeh, A.; Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 25261–25270. [Google Scholar]
Mao, J.; Ma, H.; Liang, Y. BiMambaHSI: Bidirectional Spectral–Spatial State Space Model for Hyperspectral Image Classification. Remote Sens. 2025, 17, 3676. [Google Scholar] [CrossRef]
Duc, C.M.; Fukui, H. SatMamba: Development of Foundation Models for Remote Sensing Imagery Using State Space Models. arXiv 2025, arXiv:2502.00435. [Google Scholar] [CrossRef]
Wang, F.; Wang, Y.; Chen, M.; Zhao, H.; Sun, Y.; Wang, S.; Wang, H.; Wang, D.; Lan, L.; Yang, W.; et al. Roma: Scaling up mamba-based foundation models for remote sensing. arXiv 2025, arXiv:2503.10392. [Google Scholar]
Wang, P.; Chang, H.; Hu, H.; Li, X.; Liu, X.; Liu, Y.; Zhang, Z.; Chen, C.; Li, Y.; Feng, Y.; et al. RingMamba: Remote Sensing Multi-sensor Pre-training with Visual State Space Model. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5640316. [Google Scholar]
Yang, Y.; Qu, J.; Huang, L.; Dong, W. DPMamba: Distillation prompt mamba for multimodal remote sensing image classification with missing modalities. In Proceedings of the 34th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 16–22 August 2025; pp. 2224–2232. [Google Scholar]
Teng, Y.; Wu, Y.; Shi, H.; Ning, X.; Dai, G.; Wang, Y.; Li, Z.; Liu, X. Dim: Diffusion mamba for efficient high-resolution image synthesis. arXiv 2024, arXiv:2405.14224. [Google Scholar]
Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Adv. Neural Inf. Process. Syst. 2021, 34, 572–585. [Google Scholar]
Gu, A.; Goel, K.; Gupta, A.; Ré, C. On the parameterization and initialization of diagonal state space models. Adv. Neural Inf. Process. Syst. 2022, 35, 35971–35983. [Google Scholar]
Gupta, A.; Gu, A.; Berant, J. Diagonal state spaces are as effective as structured state spaces. Adv. Neural Inf. Process. Syst. 2022, 35, 22982–22994. [Google Scholar]
Smith, J.T.; Warrington, A.; Linderman, S.W. Simplified State Space Layers for Sequence Modeling. In Proceedings of the ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Hasani, R.; Lechner, M.; Wang, T.H.; Chahine, M.; Amini, A.; Rus, D. Liquid structural state-space models. arXiv 2022, arXiv:2209.12951. [Google Scholar] [CrossRef]
Ma, X.; Zhou, C.; Kong, X.; He, J.; Gui, L.; Neubig, G.; May, J.; Zettlemoyer, L. Mega: Moving average equipped gated attention. arXiv 2022, arXiv:2209.10655. [Google Scholar]
Li, Y.; Cai, T.; Zhang, Y.; Chen, D.; Dey, D. What makes convolutional models great on long sequence modeling? arXiv 2022, arXiv:2210.09298. [Google Scholar] [CrossRef]
Orvieto, A.; Smith, S.L.; Gu, A.; Fernando, A.; Gulcehre, C.; Pascanu, R.; De, S. Resurrecting recurrent neural networks for long sequences. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 26670–26698. [Google Scholar]
Poli, M.; Massaroli, S.; Nguyen, E.; Fu, D.Y.; Dao, T.; Baccus, S.; Bengio, Y.; Ermon, S.; Ré, C. Hyena hierarchy: Towards larger convolutional language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 28043–28078. [Google Scholar]
Yang, C.; Chen, Z.; Espinosa, M.; Ericsson, L.; Wang, Z.; Liu, J.; Crowley, E.J. Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv 2024, arXiv:2403.17695. [Google Scholar]
Shi, Y.; Xia, B.; Jin, X.; Wang, X.; Zhao, T.; Xia, X.; Xiao, X.; Yang, W. Vmambair: Visual state space model for image restoration. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 5560–5574. [Google Scholar] [CrossRef]
Wang, F.; Wang, J.; Ren, S.; Wei, G.; Mei, J.; Shao, W.; Zhou, Y.; Yuille, A.; Xie, C. Mamba-Reg: Vision Mamba Also Needs Registers. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 14944–14953. [Google Scholar]
Behrouz, A.; Santacatterina, M.; Zabih, R. Mambamixer: Efficient selective state space models with dual token and channel selection. arXiv 2024, arXiv:2403.19888. [Google Scholar] [CrossRef]
Patro, B.N.; Agneeswaran, V.S. Simba: Simplified mamba-based architecture for vision and multivariate time series. arXiv 2024, arXiv:2403.15360. [Google Scholar]
Hu, V.T.; Baumann, S.A.; Gui, M.; Grebenkova, O.; Ma, P.; Fischer, J.; Ommer, B. Zigma: A dit-style zigzag mamba diffusion model. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 148–166. [Google Scholar]
Tang, X.; Yao, Y.; Ma, J.; Zhang, X.; Yang, Y.; Wang, B.; Jiao, L. SpiralMamba: Spatial-Spectral Complementary Mamba with Spatial Spiral Scan for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5510319. [Google Scholar] [CrossRef]
Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. Rs-mamba for large remote sensing image dense prediction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5633314. [Google Scholar] [CrossRef]
Xie, F.; Zhang, W.; Wang, Z.; Ma, C. Quadmamba: Learning quadtree-based selective scan for visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 117682–117707. [Google Scholar]
Li, B.; Xiao, H.; Tang, L. Scaling Vision Mamba Across Resolutions via Fractal Traversal. arXiv 2025, arXiv:2505.14062. [Google Scholar] [CrossRef]
Li, T.; Li, C.; Lyu, J.; Pei, H.; Zhang, B.; Jin, T.; Ji, R. DAMamba: Vision State Space Model with Dynamic Adaptive Scan. arXiv 2025, arXiv:2502.12627. [Google Scholar] [CrossRef]
Zhao, M.; Zhang, C.; Yue, P.; Cai, C.; Ye, F. MDA-RSM: Multi-directional adaptive remote sensing mamba for building extraction. GISci. Remote Sens. 2025, 62, 2568776. [Google Scholar] [CrossRef]
Wang, T.; Bai, T.; Xu, C.; Liu, B.; Zhang, E.; Huang, J.; Zhang, H. AtrousMamaba: An Atrous-Window Scanning Visual State Space Model for Remote Sensing Change Detection. arXiv 2025, arXiv:2507.16172. [Google Scholar]
Xiao, Y.; Yuan, Q.; Jiang, K.; Chen, Y.; Zhang, Q.; Lin, C.W. Frequency-assisted mamba for remote sensing image super-resolution. IEEE Trans. Multimed. 2024, 27, 1783–1796. [Google Scholar] [CrossRef]
Zhang, Z.; Hu, Z.; Cao, B.; Li, P.; Su, Q.; Dong, Z.; Wang, T. Wiener filter-based Mamba for Remote Sensing Image Super-Resolution with Novel Degradation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 26295–26308. [Google Scholar] [CrossRef]
Rong, Z.; Zhao, Z.; Wang, Z.; Ma, L. FaRMamba: Frequency-based learning and Reconstruction aided Mamba for Medical Segmentation. arXiv 2025, arXiv:2507.20056. [Google Scholar]
Lu, D.; Gao, K.; Li, J.; Zhang, D.; Xu, L. Exploring Token Serialization for Mamba-Based LiDAR Point Cloud Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5705514. [Google Scholar] [CrossRef]
Qin, X.; Su, X.; Zhang, L. SITSMamba for crop classification based on satellite image time series. arXiv 2024, arXiv:2409.09673. [Google Scholar] [CrossRef]
Zhu, Q.; Fang, Y.; Cai, Y.; Chen, C.; Fan, L. Rethinking scanning strategies with vision mamba in semantic segmentation of remote sensing imagery: An experimental study. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18223–18234. [Google Scholar] [CrossRef]
Wang, Z.; Zheng, J.Q.; Zhang, Y.; Cui, G.; Li, L. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv 2024, arXiv:2402.05079. [Google Scholar]
Ma, X.; Zhang, X.; Pun, M.O. Rs3mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
Wang, Y.; Cao, L.; Deng, H. MFMamba: A mamba-based multi-modal fusion network for semantic segmentation of remote sensing images. Sensors 2024, 24, 7266. [Google Scholar] [CrossRef] [PubMed]
Liu, M.; Dan, J.; Lu, Z.; Yu, Y.; Li, Y.; Li, X. CM-UNet: Hybrid CNN-Mamba UNet for remote sensing image semantic segmentation. arXiv 2024, arXiv:2405.10530. [Google Scholar]
Xiao, P.; Dong, Y.; Zhao, J.; Peng, T.; Geiß, C.; Zhong, Y.; Taubenböck, H. MF-Mamba: Multi-Scale Convolution and Mamba Fusion Model for Semantic Segmentation of Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5405916. [Google Scholar] [CrossRef]
He, Y.; Tu, B.; Liu, B.; Li, J.; Plaza, A. HSI-MFormer: Integrating Mamba and Transformer Experts for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5621916. [Google Scholar] [CrossRef]
Chen, X.; Hu, W.; Dong, X.; Lin, S.; Chen, Z.; Cao, M.; Zhuang, Y.; Han, J.; Xu, H.; Liang, X. Transmamba: Fast universal architecture adaption from transformers to mamba. arXiv 2025, arXiv:2502.15130. [Google Scholar] [CrossRef]
Li, Y.; Xie, R.; Yang, Z.; Sun, X.; Li, S.; Han, W.; Kang, Z.; Cheng, Y.; Xu, C.; Wang, D.; et al. Transmamba: Flexibly switching between transformer and mamba. arXiv 2025, arXiv:2503.24067. [Google Scholar] [CrossRef]
Li, J.; Liu, Z.; Liu, S.; Wang, H. MBSSNet: A Mamba-Based Joint Semantic Segmentation Network for Optical and SAR Images. IEEE Geosci. Remote Sens. Lett. 2025, 22, 6004305. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, X.; Quan, C.; Zhao, T.; Huo, W.; Huang, Y. Mamba-STFM: A Mamba-Based Spatiotemporal Fusion Method for Remote Sensing Images. Remote Sens. 2025, 17, 2135. [Google Scholar] [CrossRef]
Li, Z.; Wu, J.; Zhang, Y.; Yan, Y. MHCMamba: Multiscale Hybrid Convolution Mamba Network for Hyperspectral and LiDAR Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 23156–23170. [Google Scholar] [CrossRef]
Wang, W.; Yu, P.; Li, M.; Zhong, X.; He, Y.; Su, H.; Zhou, Y. Tdfnet: Twice decoding v-mamba-cnn fusion features for building extraction. Geo-Spat. Inf. Sci. 2025, 1–20. [Google Scholar] [CrossRef]
Zhao, Z.; He, P. Yolo-mamba: Object detection method for infrared aerial images. Signal Image Video Process. 2024, 18, 8793–8803. [Google Scholar] [CrossRef]
Huang, L.; Tan, J.; Chen, Z. Mamba-UAV-SegNet: A Multi-Scale Adaptive Feature Fusion Network for Real-Time Semantic Segmentation of UAV Aerial Imagery. Drones 2024, 8, 671. [Google Scholar] [CrossRef]
Li, Y.; Li, D.; Xie, W.; Ma, J.; He, S.; Fang, L. Semi-mamba: Mamba-driven semi-supervised multimodal remote sensing feature classification. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 9837–9849. [Google Scholar] [CrossRef]
Shen, Y.; Xiao, L.; Chen, J.; Du, Q.; Ye, Q. Learning Cross-task Features with Mamba for Remote Sensing Image Multi-task Prediction. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5612116. [Google Scholar] [CrossRef]
Zhang, G.; Zhang, Z.; Deng, J.; Bian, L.; Yang, C. S²CrossMamba: Spatial–Spectral Cross-Mamba for Multimodal Remote Sensing Image Classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5510705. [Google Scholar] [CrossRef]
Luo, L.; Zhang, Y.; Xu, Y.; Yue, T.; Wang, Y. A VMamba-based Spatial-Spectral Fusion Network for Remote Sensing Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 14115–14131. [Google Scholar] [CrossRef]
He, Y.; Tu, B.; Jiang, P.; Liu, B.; Li, J.; Plaza, A. Classification of Multisouce Remote Sensing Data Using Slice Mamba. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5505414. [Google Scholar]
Liu, C.; Wang, F.; Jia, Q.; Liu, L.; Zhang, T. AMamNet: Attention-Enhanced Mamba Network for Hyperspectral Remote Sensing Image Classification. Atmosphere 2025, 16, 541. [Google Scholar] [CrossRef]
Yang, X.; Yang, J.; Li, L.; Xue, S.; Shi, H.; Tang, H.; Huang, X. HG-Mamba: A Hybrid Geometry-Aware Bidirectional Mamba Network for Hyperspectral Image Classification. Remote Sens. 2025, 17, 2234. [Google Scholar] [CrossRef]
Yang, X.; Li, L.; Xue, S.; Li, S.; Yang, W.; Tang, H.; Huang, X. MRFP-Mamba: Multi-Receptive Field Parallel Mamba for Hyperspectral Image Classification. Remote Sens. 2025, 17, 2208. [Google Scholar] [CrossRef]
He, Y.; Tu, B.; Liu, B.; Li, J.; Plaza, A. 3DSS-Mamba: 3D-spectral-spatial mamba for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5534216. [Google Scholar] [CrossRef]
Li, G.; Ye, M. MVNet: Hyperspectral Remote Sensing Image Classification Based on Hybrid Mamba-Transformer Vision Backbone Architecture. arXiv 2025, arXiv:2507.04409. [Google Scholar] [CrossRef]
Sheng, J.; Zhou, J.; Wang, J.; Ye, P.; Fan, J. Dualmamba: A lightweight spectral-spatial mamba-convolution network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5501415. [Google Scholar] [CrossRef]
Yao, J.; Hong, D.; Li, C.; Chanussot, J. Spectralmamba: Efficient mamba for hyperspectral image classification. arXiv 2024, arXiv:2404.08489. [Google Scholar] [CrossRef]
Zhang, T.; Xuan, C.; Cheng, F.; Tang, Z.; Gao, X.; Song, Y. CenterMamba: Enhancing Semantic Representation with Center-Scan Mamba Network for Hyperspectral Image Classification. Expert Syst. Appl. 2025, 287, 127985. [Google Scholar] [CrossRef]
Bai, Y.; Wu, H.; Zhang, L.; Guo, H. Lightweight Mamba Model Based on Spiral Scanning Mechanism for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5502305. [Google Scholar] [CrossRef]
Wang, G.; Zhang, X.; Peng, Z.; Zhang, T.; Jiao, L. S²mamba: A spatial-spectral state space model for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5511413. [Google Scholar]
Zhang, H.; Liu, H.; Shi, Z.; Mao, S.; Chen, N. ConvMamba: Combining Mamba with CNN for hyperspectral image classification. Neurocomputing 2025, 652, 131016. [Google Scholar] [CrossRef]
Huang, L.; Chen, Y.; He, X. Spectral-spatial mamba for hyperspectral image classification. arXiv 2024, arXiv:2404.18401. [Google Scholar] [CrossRef]
Ahmad, M.; Butt, M.H.F.; Usama, M.; Altuwaijri, H.A.; Mazzara, M.; Distefano, S.; Khan, A.M. Multi-head spatial-spectral mamba for hyperspectral image classification. Remote Sens. Lett. 2025, 16, 339–353. [Google Scholar] [CrossRef]
He, Y.; Tu, B.; Jiang, P.; Liu, B.; Li, J.; Plaza, A. IGroupSS-Mamba: Interval group spatial-spectral mamba for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5538817. [Google Scholar] [CrossRef]
Lu, S.; Zhang, M.; Huo, Y.; Wang, C.; Wang, J.; Gao, C. SSUM: Spatial–spectral unified Mamba for hyperspectral image classification. Remote Sens. 2024, 16, 4653. [Google Scholar] [CrossRef]
Duan, Y.; Yu, L.; Chen, J.; Zeng, Z.; Li, J.; Plaza, A. A New Multiscale Superpixel Mamba for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5527016. [Google Scholar] [CrossRef]
Song, Q.; Tu, B.; He, Y.; Liu, B.; Li, J.; Plaza, A. Superpixel-Integrated Dual-Stage Mamba for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5526617. [Google Scholar] [CrossRef]
Yang, A.; Li, M.; Ding, Y.; Fang, L.; Cai, Y.; He, Y. Graphmamba: An efficient graph structure learning vision mamba for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5537414. [Google Scholar] [CrossRef]
Wang, Y.; Liu, L.; Xiao, J.; Yu, D.; Tao, Y.; Zhang, W. MambaHSI+: Multidirectional State Propagation for Efficient Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4411414. [Google Scholar] [CrossRef]
Ming, R.; Chen, N.; Peng, J.; Sun, W.; Ye, Z. Semantic Tokenization-Based Mamba for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4227–4241. [Google Scholar] [CrossRef]
Zhao, F.; Zhang, Z.; Huang, L.; Hai, Y.; Fu, Z.; Tang, B.H. MHS-Mamba: A Multi-Hierarchical Semantic Model for UAV Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 24617–24631. [Google Scholar] [CrossRef]
Du, A.; Zhao, G.; Cao, M.; Wang, Y.; Dong, A.; Lv, G.; Gao, Y.; Li, D.; Dong, X. Cross-domain hyperspectral image classification via mamba-CNN and knowledge distillation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5524415. [Google Scholar] [CrossRef]
Huang, X.; Zhang, Y.; Luo, F.; Dong, Y. Dynamic token augmentation mamba for cross-scene classification of hyperspectral image. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5539713. [Google Scholar] [CrossRef]
Xu, Y.; Wang, D.; Jiao, H.; Zhang, L.; Zhang, L. MambaMoE: Mixture-of-Spectral-Spatial-Experts State Space Model for Hyperspectral Image Classification. arXiv 2025, arXiv:2504.20509. [Google Scholar] [CrossRef]
Ahmad, M.; Butt, M.H.F.; Usama, M.; Mazzara, M.; Distefano, S.; Khan, A.M.; Hong, D. Hybrid State-Space and GRU-based Graph Tokenization Mamba for Hyperspectral Image Classification. arXiv 2025, arXiv:2502.06427. [Google Scholar]
Wang, H.; Zhuang, P.; Zhang, X.; Li, J. DBMGNet: A Dual-Branch Mamba-GCN Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4410517. [Google Scholar] [CrossRef]
Liao, J.; Wang, L. HyperspectralMamba: A Novel State Space Model Architecture for Hyperspectral Image Classification. Remote Sens. 2025, 17, 2577. [Google Scholar] [CrossRef]
Sun, M.; Zhang, J.; He, X.; Zhong, Y. Bidirectional mamba with dual-branch feature extraction for hyperspectral image classification. Sensors 2024, 24, 6899. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, Y.; Guo, Y.; Li, Y. Lightweight spatial-spectral shift module with multi-head MambaOut for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 921–934. [Google Scholar] [CrossRef]
Sun, M.; Wang, L.; Jiang, S.; Cheng, S.; Tang, L. HyperSMamba: A Lightweight Mamba for Efficient Hyperspectral Image Classification. Remote Sens. 2025, 17, 2008. [Google Scholar] [CrossRef]
Liang, L.; Zhang, J.; Duan, P.; Kang, X.; Wu, T.X.; Li, J.; Plaza, A. LKMA: Learnable Kernel and Mamba with Spatial-Spectral Attention Fusion for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5530914. [Google Scholar] [CrossRef]
Arya, R.K.; Jain, S.; Chattopadhyay, P.; Srivastava, R. HSIRMamba: An effective feature learning for hyperspectral image classification using residual Mamba. Image Vis. Comput. 2025, 154, 105387. [Google Scholar] [CrossRef]
Paoletti, M.E.; Wu, Z.; Zheng, P.; Hong, D.; Haut, J.M. DenseMixerMamba: Residual Mixing for Spectral-Spatial Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5529919. [Google Scholar] [CrossRef]
Wang, C.; Huang, J.; Lv, M.; Du, H.; Wu, Y.; Qin, R. A local enhanced mamba network for hyperspectral image classification. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104092. [Google Scholar] [CrossRef]
Zhang, J.; Sun, M.; Chang, S. Spatial and Spectral Structure-Aware Mamba Network for Hyperspectral Image Classification. Remote Sens. 2025, 17, 2489. [Google Scholar] [CrossRef]
Ahmad, M.; Usama, M.; Mazzara, M.; Distefano, S. Wavemamba: Spatial-spectral wavelet mamba for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2024, 22, 5500505. [Google Scholar] [CrossRef]
Ahmad, M.; Butt, M.H.F.; Khan, A.M.; Mazzara, M.; Distefano, S.; Usama, M.; Roy, S.K.; Chanussot, J.; Hong, D. Spatial–spectral morphological mamba for hyperspectral image classification. Neurocomputing 2025, 636, 129995. [Google Scholar] [CrossRef]
Zhang, H.; Xu, X.; Li, S.; Plaza, A. Wavelet Decomposition-Based Spectral-Spatial Mamba Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5518817. [Google Scholar] [CrossRef]
Zhuang, P.; Zhang, X.; Wang, H.; Zhang, T.; Liu, L.; Li, J. Fahm: Frequency-aware hierarchical mamba for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6299–6313. [Google Scholar] [CrossRef]
Zhu, M.; Wang, H.; Meng, Y.; Xu, S.; Lin, Y.; Shan, Z.; Ma, Z. Self-Supervised Mamba for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5531312. [Google Scholar] [CrossRef]
Ding, H.; Liu, J.; Wang, Z.; Peng, Y.; Li, H. Mamba-Driven Multi-Scale Spatial-Spectral Fusion Network for Few-Shot Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 20742–20762. [Google Scholar] [CrossRef]
Wang, Q.; Jiang, X.; Xu, G. CSFMamba: Cross State Fusion Mamba Operator for Multimodal Remote Sensing Image Classification. arXiv 2025, arXiv:2509.00677. [Google Scholar] [CrossRef]
Xing, Y.; Jia, Y.; Gao, S.; Hu, J.; Huang, R. Frequency-enhanced mamba for remote sensing change detection. IEEE Geosci. Remote Sens. Lett. 2025, 22, 2501605. [Google Scholar] [CrossRef]
Gao, F.; Jin, X.; Zhou, X.; Dong, J.; Du, Q. MSFMamba: Multi-scale feature fusion state space model for multi-source remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5504116. [Google Scholar]
Li, S.; Huang, S. AFA–Mamba: Adaptive feature alignment with global–local mamba for hyperspectral and LiDAR data classification. Remote Sens. 2024, 16, 4050. [Google Scholar] [CrossRef]
Pan, H.; Zhao, R.; Ge, H.; Liu, M.; Zhang, Q. Multi-Modal Fusion Mamba Network for Joint Land Cover Classification Using Hyperspectral and LiDAR Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 17328–17345. [Google Scholar] [CrossRef]
Li, D.; Li, B.; Liu, Y. Mamba Cross-Modal Information Fusion Self-Distillation Model for Joint Classification of LiDAR and Hyperspectral Data. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5522013. [Google Scholar] [CrossRef]
Shi, C.; Zhu, F.; Shi, K.; Wang, L.; Pan, H. TBi-Mamba: Rethinking Joint Classification of Hyperspectral and LiDAR Data with Bidirectional Mamba. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5522515. [Google Scholar] [CrossRef]
Li, Z.; Wu, J.; Zhang, Y.; Yan, Y. CMFNet: Cross Mamba Fusion Network for Hyperspectral and LiDAR Data Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4418614. [Google Scholar] [CrossRef]
Xie, Z.; Lv, L.; Gao, H.; Xu, S.; Xie, H. Dual-Feature Attention Hybrid GCN Mamba Network for Joint Hyperspectral and LiDAR Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5406514. [Google Scholar] [CrossRef]
Cao, M.; Xie, W.; Zhang, X.; Zhang, J.; Jiang, K.; Lei, J.; Li, Y. M³amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 7605–7617. [Google Scholar] [CrossRef]
Ye, F.; Tan, S.; Huang, W.; Xu, X.; Jiang, S. MambaTriNet: A Mamba based Tri-backbone multimodal remote sensing image semantic segmentation model. IEEE Geosci. Remote Sens. Lett. 2025, 22, 2503205. [Google Scholar] [CrossRef]
Liao, D.; Wang, Q.; Lai, T.; Huang, H. Joint classification of hyperspectral and lidar data base on mamba. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5530915. [Google Scholar] [CrossRef]
He, X.; Han, X.; Chen, Y.; Huang, L. A light-weighted fusion vision mamba for multimodal remote sensing data classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 21532–21548. [Google Scholar]
Yue, Z.; Xu, J.; Yan, Y.; Su, M. TFFNet: Transform Fusion Fuzzy Network for Multimodal Remote Sensing Classification. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5509505. [Google Scholar]
Peng, S.; Zhu, X.; Deng, H.; Deng, L.J.; Lei, Z. Fusionmamba: Efficient remote sensing image fusion with state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5410216. [Google Scholar] [CrossRef]
Wu, H.; Sun, Z.; Qi, J.; Zhan, T.; Xu, Y.; Wei, Z. Spatial-Spectral Cross Mamba Network for Hyperspectral and Multispectral Image Fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5524113. [Google Scholar] [CrossRef]
Zhao, G.; Wu, H.; Luo, D.; Ou, X.; Zhang, Y. Spatial spectral interaction super-resolution cnn-mamba network for fusion of satellite hyperspectral and multispectral image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18489–18501. [Google Scholar] [CrossRef]
Zhang, Y.; Song, Y.; Duan, Q.; Yu, N.; Li, B.; Gao, X. S²CMamba: A Mamba-based Pan-sharpening Model Incorporating Spatial and Spectral Consistency. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5518013. [Google Scholar]
Zhu, C.; Deng, S.; Song, X.; Li, Y.; Wang, Q. Mamba collaborative implicit neural representation for hyperspectral and multispectral remote sensing image fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5504915. [Google Scholar] [CrossRef]
Li, Z.; Wen, Y.; Xiao, S.; Qu, J.; Li, N.; Dong, W. A Progressive Registration-Fusion Co-Optimization A-Mamba Network: Towards Deep Unregistered Hyperspectral and Multispectral Fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5514815. [Google Scholar] [CrossRef]
Xiao, L.; Guo, S.; Mo, F.; Song, Q.; Yang, Y.; Liu, Y.; Wei, X.; Yang, T.; Dian, R. Spatial Invertible Network with Mamba-Convolution for Hyperspectral Image Fusion. IEEE J. Sel. Top. Signal Process. 2025. early access. [Google Scholar] [CrossRef]
Zhao, M.; Jiang, X.; Huang, B. STFMamba: Spatiotemporal satellite image fusion network based on visual state space model. ISPRS J. Photogramm. Remote Sens. 2025, 228, 288–304. [Google Scholar] [CrossRef]
Bioucas-Dias, J.M.; Plaza, A.; Dobigeon, N.; Parente, M.; Du, Q.; Gader, P.; Chanussot, J. Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 354–379. [Google Scholar]
Zhang, M.; Xie, H.; Yang, M.; Jiao, Q.; Xu, L.; Tan, X. Mamba-Enhanced Spatial-Spectral Feature Learning for Hyperspectral Unmixing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 22798–22815. [Google Scholar] [CrossRef]
Chen, D.; Zhang, J.; Li, J. UNMamba: Cascaded Spatial-Spectral Mamba for Blind Hyperspectral Unmixing. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5502405. [Google Scholar] [CrossRef]
Liu, Y.; Liu, S.; Wang, H. Efficient Progressive Mamba Model for Hyperspectral Sequence Unmixing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 19511–19526. [Google Scholar] [CrossRef]
Gan, Y.; Wei, J.; Xu, M. Mamba-based spatial-spectral fusion network for hyperspectral unmixing. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 32. [Google Scholar] [CrossRef]
Qu, K.; Wang, H.; Ding, M.; Luo, X.; Bao, W. DGMNet: Hyperspectral Unmixing Dual-Branch Network Integrating Adaptive Hop-Aware GCN and Neighborhood Offset Mamba. Remote Sens. 2025, 17, 2517. [Google Scholar] [CrossRef]
Zheng, X.; Kuang, Y.; Huo, Y.; Zhu, W.; Zhang, M.; Wang, H. HTMNet: Hybrid Transformer–Mamba Network for Hyperspectral Target Detection. Remote Sens. 2025, 17, 3015. [Google Scholar] [CrossRef]
Shen, D.; Zhu, X.; Tian, J.; Liu, J.; Du, Z.; Wang, H.; Ma, X. HTD-Mamba: Efficient Hyperspectral Target Detection with Pyramid State Space Model. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5507315. [Google Scholar] [CrossRef]
Li, L.; Wang, B. DPMN: Deep Prior Mamba Network for Hyperspectral Anomaly Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5531516. [Google Scholar] [CrossRef]
Fu, X.; Zhang, T.; Cheng, J.; Jia, S. MMR-HAD: Multi-scale Mamba Reconstruction Network for Hyperspectral Anomaly Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5516914. [Google Scholar] [CrossRef]
Li, F.; Wang, X.; Wang, H.; Karimian, H.; Shi, J.; Zha, G. LMVMamba: A Hybrid U-Shape Mamba for Remote Sensing Segmentation with Adaptation Fine-Tuning. Remote Sens. 2025, 17, 3367. [Google Scholar] [CrossRef]
Cao, Y.; Liu, C.; Wu, Z.; Zhang, L.; Yang, L. Remote sensing image segmentation using vision mamba and multi-scale multi-frequency feature fusion. Remote Sens. 2025, 17, 1390. [Google Scholar] [CrossRef]
Zhu, E.; Chen, Z.; Wang, D.; Shi, H.; Liu, X.; Wang, L. Unetmamba: An efficient unet-like mamba for semantic segmentation of high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2024, 22, 6001205. [Google Scholar] [CrossRef]
Du, W.L.; Gu, Y.; Zhao, J.; Zhu, H.; Yao, R.; Zhou, Y. A mamba-diffusion framework for multimodal remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6016905. [Google Scholar] [CrossRef]
Zhou, W.; Yang, P.; Liu, Y. HLMamba: Hybrid Lightweight Mamba-Based Fusion Network for Dense Prediction of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4414211. [Google Scholar] [CrossRef]
Sun, H.; Liu, J.; Yang, J.; Wu, Z. HMAFNet: Hybrid Mamba-Attention Fusion Network for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2025, 22, 8001405. [Google Scholar] [CrossRef]
Zheng, K.; Yu, M.; Liu, Z.; Bao, S.; Pan, Z.; Song, Y.; Zhu, L.; Xie, Z. Frequency and prompt learning cooperation enhanced mamba for remote sensing semantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025. early access. [Google Scholar] [CrossRef]
Huang, P.; Zhang, K.; Ma, M.; Mei, S.; Wang, J. Semantic-Geometric Consistency-enforcing with Mamba-augmented Network for Remote Sensing Image Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 27814–27827. [Google Scholar] [CrossRef]
Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic segmentation of remotely sensed images with state space model. Heliyon 2024, 10, e38495. [Google Scholar] [CrossRef]
Mu, J.; Zhou, S.; Sun, X. PPMamba: Enhancing Semantic Segmentation in Remote Sensing Imagery by SS2D. IEEE Geosci. Remote Sens. Lett. 2024, 22, 6001705. [Google Scholar] [CrossRef]
Li, M.; Xing, Z.; Wang, H.; Jiang, H.; Xie, Q. SF-Mamba: A Semantic-flow Foreground-aware Mamba for Semantic Segmentation of Remote Sensing Images. IEEE MultiMedia 2025, 32, 85–95. [Google Scholar] [CrossRef]
Fang, X.; Liu, Z.; Xie, S.A.; Ge, Y. Semantic Segmentation of High-Resolution Remote Sensing Images Based on RS3Mamba: An Investigation of the Extraction Algorithm for Rural Compound Utilization Status. Remote Sens. 2025, 17, 3443. [Google Scholar] [CrossRef]
Wen, R.; Yuan, Y.; Xu, X.; Yin, S.; Chen, Z.; Zeng, H.; Wang, Z. MambaSegNet: A Fast and Accurate High-Resolution Remote Sensing Imagery Ship Segmentation Network. Remote Sens. 2025, 17, 3328. [Google Scholar] [CrossRef]
Yan, L.; Feng, Q.; Wang, J.; Cao, J.; Feng, X.; Tang, X. A multilevel multimodal hybrid mamba-large strip convolution network for remote sensing semantic segmentation. Remote Sens. 2025, 17, 2696. [Google Scholar] [CrossRef]
Qiu, J.; Chang, W.; Ren, W.; Hou, S.; Yang, R. MMFNet: A Mamba-Based Multimodal Fusion Network for Remote Sensing Image Semantic Segmentation. Sensors 2025, 25, 6225. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Pan, H.; Liu, X.; Ren, J.; Du, Z.; Cao, J. GLVMamba: A Global-Local Visual State Space Model for Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4412115. [Google Scholar] [CrossRef]
Hu, Y.; Ma, X.; Sui, J.; Pun, M.O. Ppmamba: A pyramid pooling local auxiliary ssm-based model for remote sensing image semantic segmentation. arXiv 2024, arXiv:2409.06309. [Google Scholar] [CrossRef]
Zhang, Q.; Geng, G.; Zhou, P.; Liu, Q.; Wang, Y.; Li, K. Link aggregation for skip connection–mamba: Remote sensing image segmentation network based on link aggregation mamba. Remote Sens. 2024, 16, 3622. [Google Scholar] [CrossRef]
Ma, C.; Wang, Z. Semi-Mamba-UNet: Pixel-level contrastive and cross-supervised visual Mamba-based UNet for semi-supervised medical image segmentation. Knowl.-Based Syst. 2024, 300, 112203. [Google Scholar] [CrossRef]
Zhu, Q.; Li, H.; He, L.; Fan, L. SwinMamba: A hybrid local-global mamba framework for enhancing semantic segmentation of remotely sensed images. arXiv 2025, arXiv:2509.20918. [Google Scholar]
Wang, L.; Li, D.; Dong, S.; Meng, X.; Zhang, X.; Hong, D. PyramidMamba: Rethinking pyramid feature fusion with selective space state model for semantic segmentation of remote sensing imagery. arXiv 2024, arXiv:2406.10828. [Google Scholar] [CrossRef]
Chen, H.; Luo, H.; Wang, C. AfaMamba: Adaptive Feature Aggregation with Visual State Space Model for Remote Sensing Images Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 8965–8983. [Google Scholar]
Lin, B.; Zou, Z.; Shi, Z. RSBEV-Mamba: 3D BEV Sequence Modeling for Multi-View Remote Sensing Scene Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5613213. [Google Scholar]
Li, L.; Yi, J.; Fan, H.; Lin, H. A Lightweight Semantic Segmentation Network Based on Self-attention Mechanism and State Space Model for Efficient Urban Scene Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4703215. [Google Scholar] [CrossRef]
Yang, Y.; Yuan, G.; Li, J. Dual-Branch Network for Spatial-Channel Stream Modeling Based on the State Space Model for Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5907719. [Google Scholar] [CrossRef]
Zhao, Y.; Qiu, L.; Yang, Z.; Chen, Y.; Zhang, Y. MGF-GCN: Multimodal interaction Mamba-aided graph convolutional fusion network for semantic segmentation of remote sensing images. Inf. Fusion 2025, 122, 103150. [Google Scholar] [CrossRef]
Du, W.L.; Tang, S.; Zhao, J.; Yao, R.; Zhou, Y. MoViM: A Hybrid CNN Vision Mamba Network for Lightweight Semantic Segmentation of Multimodal Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2025, 22, 6015305. [Google Scholar] [CrossRef]
Wang, Z.; Xu, N.; You, Z.; Zhang, S. DiffMamba: Semantic diffusion guided feature modeling network for semantic segmentation of remote sensing images. GISci. Remote Sens. 2025, 62, 2484829. [Google Scholar] [CrossRef]
Wang, Z.; Yi, J.; Chen, A.; Chen, L.; Lin, H.; Xu, K. Accurate semantic segmentation of very high-resolution remote sensing images considering feature state sequences: From benchmark datasets to urban applications. ISPRS J. Photogramm. Remote Sens. 2025, 220, 824–840. [Google Scholar] [CrossRef]
Chai, X.; Zhang, W.; Li, Z.; Zhang, N.; Chai, X. AECA-FBMamba: A Framework with Adaptive Environment Channel Alignment and Mamba Bridging Semantics and Details. Remote Sens. 2025, 17, 1935. [Google Scholar]
Li, D.; Zhao, J.; Chang, C.; Chen, Z.; Du, J. LGMamba: Large-Scale ALS Point Cloud Semantic Segmentation with Local and Global State Space Model. IEEE Geosci. Remote Sens. Lett. 2024, 22, 6500605. [Google Scholar] [CrossRef]
Zhou, M.; Li, T.; Qiao, C.; Xie, D.; Wang, G.; Ruan, N.; Mei, L.; Yang, Y.; Shen, H.T. Dmm: Disparity-guided multispectral mamba for oriented object detection in remote sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5404913. [Google Scholar] [CrossRef]
Wang, S.; Wang, C.; Shi, C.; Liu, Y.; Lu, M. Mask-guided mamba fusion for drone-based visible-infrared vehicle detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5005712. [Google Scholar] [CrossRef]
Liu, C.; Ma, X.; Yang, X.; Zhang, Y.; Dong, Y. COMO: Cross-mamba interaction and offset-guided fusion for multimodal object detection. Inf. Fusion 2026, 125, 103414. [Google Scholar]
Ren, K.; Wu, X.; Xu, L.; Wang, L. Remotedet-mamba: A hybrid mamba-cnn network for multi-modal object detection in remote sensing images. arXiv 2024, arXiv:2410.13532. [Google Scholar]
Li, W.; Yuan, F.; Zhang, H.; Lv, Z.; Wu, B. Hyperspectral object detection based on spatial–spectral fusion and visual mamba. Remote Sens. 2024, 16, 4482. [Google Scholar] [CrossRef]
Rong, Q.; Jing, H.; Zhang, M. Scale Sensitivity Mamba Network for Object Detection in Remote Sensing Images. IEEE Sens. J. 2025, 25, 43339–43351. [Google Scholar] [CrossRef]
Wu, S.; Lu, X.; Guo, C. YOLOv5_mamba: Unmanned aerial vehicle object detection based on bidirectional dense feedback network and adaptive gate feature fusion. Sci. Rep. 2024, 14, 22396. [Google Scholar] [CrossRef]
Wu, S.; Lu, X.; Guo, C.; Guo, H. MV-YOLO: An Efficient Small Object Detection Framework Based on Mamba. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5632814. [Google Scholar] [CrossRef]
Verma, T.; Singh, J.; Bhartari, Y.; Jarwal, R.; Singh, S.; Singh, S. Soar: Advancements in small body object detection for aerial imagery using state space models and programmable gradients. arXiv 2024, arXiv:2405.01699. [Google Scholar] [CrossRef]
Xiao, Z.; Li, Z.; Cao, J.; Liu, X.; Kong, Y.; Du, Z. OriMamba: Remote sensing oriented object detection with state space models. Int. J. Appl. Earth Obs. Geoinf. 2025, 143, 104731. [Google Scholar] [CrossRef]
Chen, J.; Wei, J.; Wu, G.; Yang, J.; Shang, J.; Guo, H.; Zhang, D.; Zhu, S. MambaRetinaNet: Improving remote sensing object detection by fusing Mamba and multi-scale convolution. Appl. Comput. Geosci. 2025, 28, 100305. [Google Scholar] [CrossRef]
Tian, B.; Lu, Z.; Zhang, C.; Li, H.; Yu, P. MSMD-YOLO: Multi-scale and multi-directional Mamba scanning infrared image object detection based on YOLO. Infrared Phys. Technol. 2025, 150, 106011. [Google Scholar] [CrossRef]
Yan, L.; He, Z.; Zhang, Z.; Xie, G. LS-MambaNet: Integrating Large Strip Convolution and Mamba Network for Remote Sensing Object Detection. Remote Sens. 2025, 17, 1721. [Google Scholar] [CrossRef]
Tu, H.; Wang, W.; Guo, Y.; Chen, S. Mamba-UDA: Mamba Unsupervised Domain Adaptation for SAR Ship Detection. IEEE Geosci. Remote Sens. Lett. 2025, 22, 4011205. [Google Scholar] [CrossRef]
Liu, X.; Feng, C.; Zi, S.; Qin, Z.; Guan, Q. M-ReDet: A mamba-based method for remote sensing ship object detection and fine-grained recognition. PLoS ONE 2025, 20, e0330485. [Google Scholar] [CrossRef]
Liu, P.; Lei, S.; Li, H.C. Mamba-MOC: A Multicategory Remote Object Counting via State Space Model. arXiv 2025, arXiv:2501.06697. [Google Scholar] [CrossRef]
Wang, Q.; Zhou, L.; Jin, P.; Qu, X.; Zhong, H.; Song, H.; Shen, T. TrackingMamba: Visual state space model for object tracking. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 16744–16754. [Google Scholar] [CrossRef]
Jiang, J.; Liao, S.; Yang, X.; Shen, K. EAMNet: Efficient Adaptive Mamba Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5008517. [Google Scholar] [CrossRef]
Li, B.; Rao, P.; Su, Y.; Chen, X. HMCNet: A Hybrid Mamba–CNN UNet for Infrared Small Target Detection. Remote Sens. 2025, 17, 452. [Google Scholar] [CrossRef]
Yu, Z.; Zhang, Z.; Tian, H.; Zhou, Q.; Zhang, H. SBMambaNet: Spatial-BiDirectional Mamba Network for infrared small target detection. Infrared Phys. Technol. 2025, 150, 105928. [Google Scholar] [CrossRef]
Ge, Y.; Liang, T.; Ren, J.; Chen, J.; Bi, H. Enhanced salient object detection in remote sensing images via dual-stream semantic interactive network. Vis. Comput. 2025, 41, 5153–5169. [Google Scholar] [CrossRef]
Yang, W.; Yi, Z.; Huang, A.; Wang, Y.; Yao, Y.; Li, Y. Topology-Aware Hierarchical Mamba for Salient Object Detection in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5646316. [Google Scholar] [CrossRef]
Li, J.; Wang, Z.; Xu, N.; Zhang, C. TSFANet: Trans-Mamba Hybrid Network with Semantic Feature Alignment for Remote Sensing Salient Object Detection. Remote Sens. 2025, 17, 1902. [Google Scholar] [CrossRef]
Xing, G.; Wang, M.; Wang, F.; Sun, F.; Li, H. Lightweight edge-aware mamba-fusion network for weakly supervised salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5631813. [Google Scholar] [CrossRef]
Li, Y.; Wang, L.; Chen, S. SMILE: Spatial-Spectral Mamba Interactive Learning for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5005214. [Google Scholar] [CrossRef]
Chen, T.; Ye, Z.; Tan, Z.; Gong, T.; Wu, Y.; Chu, Q.; Liu, B.; Yu, N.; Ye, J. Mim-istd: Mamba-in-mamba for efficient infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5007613. [Google Scholar] [CrossRef]
Yu, C.; Yang, H.; Ma, L.; Yang, J.; Jin, Y.; Zhang, W.; Wang, K.; Zhao, Q. Deep Learning-Based Change Detection in Remote Sensing: A Comprehensive Review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 24415–24437. [Google Scholar] [CrossRef]
Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote sensing change detection with spatiotemporal state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409720. [Google Scholar] [CrossRef]
Liu, S.; Wang, S.; Zhang, W.; Zhang, T.; Xu, M.; Yasir, M.; Wei, S. CD-STMamba: Towards Remote Sensing Image Change Detection with Spatio-Temporal Interaction Mamba Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 10471–10485. [Google Scholar] [CrossRef]
Wu, Z.; Ma, X.; Lian, R.; Zheng, K.; Ma, M.; Zhang, W.; Song, S. CD-lamba: Boosting remote sensing change detection via a cross-temporal locally adaptive state space model. arXiv 2025, arXiv:2501.15455. [Google Scholar] [CrossRef]
Kaung, J.; Ge, H. 2DMCG: 2DMambawith Change Flow Guidance for Change Detection in Remote Sensing. arXiv 2025, arXiv:2503.00521. [Google Scholar]
Xu, Z.; Zhu, Y.; Dewis, Z.; Heffring, M.; Alkayid, M.; Taleghanidoozdoozan, S.; Xu, L.L. Knowledge-Aware Mamba for Joint Change Detection and Classification from MODIS Times Series. arXiv 2025, arXiv:2510.09679. [Google Scholar]
Zhao, J.; Xie, J.; Zhou, Y.; Du, W.L.; Yao, R.; El Saddik, A. ST-Mamba: Spatio-Temporal Synergistic Model for Remote Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4412413. [Google Scholar]
Zhou, S.; Xu, C.; Fan, G.; Li, J.; Hua, Z.; Zhou, J. Sprmamba: A mamba-based saliency proportion reconciliatory network with squeezed windows for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4705516. [Google Scholar] [CrossRef]
Xu, G.; Liu, Y.; Deng, L.; Wang, X.; Zhu, H. Smnet: A semantic guided mamba network for remote sensing change detection. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 11116–11127. [Google Scholar] [CrossRef]
Wang, L.; Sun, Q.; Pei, J.; Khan, M.A.; Al Dabel, M.M.; Al-Otaibi, Y.D.; Bashir, A.K. Bi-Temporal Remote Sensing Change Detection with State Space Models. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 14942–14954. [Google Scholar] [CrossRef]
Zhang, H.; Chen, K.; Liu, C.; Chen, H.; Zou, Z.; Shi, Z. CDMamba: Incorporating local clues into mamba for remote sensing image binary change detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4405016. [Google Scholar] [CrossRef]
Liu, Y.; Cheng, G.; Sun, Q.; Tian, C.; Wang, L. CWmamba: Leveraging CNN-Mamba fusion for enhanced change detection in remote sensing images. IEEE Geosci. Remote Sens. Lett. 2025, 22, 2501505. [Google Scholar] [CrossRef]
Feng, Y.; Zhuo, L.; Zhang, H.; Li, J. Hybrid-MambaCD: Hybrid Mamba-CNN Network for Remote Sensing Image Change Detection with Region-Channel Attention Mechanism and Iterative Global-Local Feature Fusion. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5907912. [Google Scholar] [CrossRef]
Dong, Z.; Yuan, G.; Hua, Z.; Li, J. ConMamba: CNN and SSM high-performance hybrid network for remote sensing change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5935115. [Google Scholar] [CrossRef]
Wang, J.; Song, J.; Zhang, H.; Zhang, Z.; Ji, Y.; Zhang, W.; Zhang, J.; Wang, X. SPMNet: A Siamese Pyramid Mamba Network for Very-High-Resolution Remote Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4410314. [Google Scholar] [CrossRef]
Huang, J.; Yuan, X.; Lam, C.T.; Wang, Y.; Xia, M. LCCDMamba: Visual state space model for land cover change detection of VHR remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 5765–5781. [Google Scholar] [CrossRef]
Zhang, Z.; Fan, X.; Wang, X.; Qin, Y.; Xia, J. A novel remote sensing image change detection approach based on multi-level state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4417014. [Google Scholar] [CrossRef]
Chen, Z.; Chen, H.; Leng, J.; Zhang, X.; Gao, Q.; Dong, W. VMMCD: VMamba-Based Multi-Scale Feature Guiding Fusion Network for Remote Sensing Change Detection. Remote Sens. 2025, 17, 1840. [Google Scholar] [CrossRef]
Wang, S.; Cheng, D.; Yuan, G.; Li, J. RDSF-Net: Residual Wavelet Mamba-Based Differential Completion and Spatio-Frequency Extraction Remote Sensing Change Detection Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 11573–11587. [Google Scholar] [CrossRef]
Wang, S.; Yuan, G.; Li, J. GSSR-Net: Geo-Spatial Structural Refinement Network for Remote Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5909715. [Google Scholar] [CrossRef]
Song, Z.; Wu, Y.; Huang, S. Mamba-MSCCA-Net: Efficient change detection for remote sensing images. Displays 2025, 90, 103097. [Google Scholar] [CrossRef]
Guo, Y.; Xu, Y.; Tang, G.; Yu, Z.; Tang, Q. AM-CD: Joint Attention and Mamba for Remote Sensing Image Change Detection. Neurocomputing 2025, 647, 130607. [Google Scholar] [CrossRef]
Wang, H.; Ye, Z.; Xu, C.; Mei, L.; Lei, C.; Wang, D. TTMGNet: Tree Topology Mamba-Guided Network Collaborative Hierarchical Incremental Aggregation for Change Detection. Remote Sens. 2024, 16, 4068. [Google Scholar] [CrossRef]
Song, J.; Yang, S.; Li, Y.; Li, X. An Unsupervised Remote Sensing Image Change Detection Method Based on RVMamba and Posterior Probability Space Change Vector. Remote Sens. 2024, 16, 4656. [Google Scholar] [CrossRef]
Ma, J.; Li, B.; Li, H.; Meng, S.; Lu, R.; Mei, S. Remote Sensing Change Detection by Pyramid Sequential Processing with Mamba. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 19481–19495. [Google Scholar] [CrossRef]
Liu, F.; Wen, Y.; Sun, J.; Zhu, P.; Mao, L.; Niu, G.; Li, J. Iterative Mamba Diffusion Change-Detection Model for Remote Sensing. Remote Sens. 2024, 16, 3651. [Google Scholar] [CrossRef]
Sun, M.; Guo, F. DC-Mamba: Bi-temporal deformable alignment and scale-sparse enhancement for remote sensing change detection. arXiv 2025, arXiv:2509.15563. [Google Scholar]
Huang, Z.; Duan, P.; Yuan, G.; Li, J. MSA: Mamba Semantic Alignment Networks for Remote Sensing Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 10625–10639. [Google Scholar] [CrossRef]
Li, Y.; Liu, W.; Li, E.; Zhang, L.; Li, X. Sam-mamba: A two-stage change detection network combining the adapting segment anything and mamba models. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 21607–21619. [Google Scholar] [CrossRef]
Qin, Y.; Wang, C.; Fan, Y.; Pan, C. SAM2-CD: Remote Sensing Image Change Detection with SAM2. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 24575–24587. [Google Scholar]
Zhang, J.; Chen, R.; Liu, F.; Liu, H.; Zheng, B.; Hu, C. DC-Mamba: A novel network for enhanced remote sensing change detection in difficult cases. Remote Sens. 2024, 16, 4186. [Google Scholar] [CrossRef]
Chen, D.; Liang, X.; Wang, L.; Guo, Q.; Zhang, J. Global Difference-Aware Mamba for Hyperspectral Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5523214. [Google Scholar] [CrossRef]
Ding, C.; Hao, X.; Zheng, S.; Dong, Y.; Hua, W.; Wei, W.; Zhang, L.; Zhang, Y. A Wavelet-Augmented Dual-Branch Position-Embedding Mamba Network for Hyperspectral Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5523918. [Google Scholar] [CrossRef]
Zhan, T.; Qi, J.; Zhang, J.; Yu, X.; Du, Q.; Wu, Z. Spatial-Spectral Feature–Enhanced Mamba and SAM-Guided Hyperspectral Multi-class Change Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5541113. [Google Scholar] [CrossRef]
Fu, Y.; Wu, Z.; Zheng, Z.; Zhu, Q.; Gu, Y.; Kwan, M.P. Mamba-LCD: Robust Urban Change Detection in Low-Light Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 21200–21212. [Google Scholar] [CrossRef]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. Rsmamba: Remote sensing image classification with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
Yang, M.; Chen, L. HC-Mamba: Remote Sensing Image Classification via Hybrid Cross-activation State Space Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 10429–10441. [Google Scholar] [CrossRef]
Yan, L.; Zhang, X.; Wang, K.; Zhang, D. Contour-enhanced visual state-space model for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5603614. [Google Scholar] [CrossRef]
Li, D.; Liu, R.; Liu, Y. MPFASS-Net: A Mamba Progressive Feature Aggregation Network with Self-Supervised for Remote Sensing Image Scene Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5523614. [Google Scholar] [CrossRef]
Roy, S.; Sar, A.; Kaushish, A.; Choudhury, T.; Um, J.S.; Israr, M.; Mohanty, S.N.; Abhraham, A. HSS-KAMNet: A Hybrid Spectral-Spatial Kolmogorov-Arnold Mamba Network for Residential Land Cover Identification on RS Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 29379–29398. [Google Scholar] [CrossRef]
Kuang, Z.; Bi, H.; Li, F.; Xu, C. ECP-Mamba: An Efficient Multi-scale Self-supervised Contrastive Learning Method with State Space Model for PolSAR Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5218718. [Google Scholar] [CrossRef]
Du, R.; Tang, X.; Ma, J.; Zhang, X.; Jiao, L. MLMamba: A Mamba-based Efficient Network for Multi-label Remote Sensing Scene Classification. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 6245–6258. [Google Scholar] [CrossRef]
Jiang, K.; Yang, M.; Xiao, Y.; Wu, J.; Wang, G.; Feng, X.; Jiang, J. Rep-Mamba: Re-Parameterization in Vision Mamba for Lightweight Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5637012. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, R.; Fu, W.; Chen, J.; Dai, A. CM2-Net: A Hybrid CNN-Mamba2 Net for 3D Electromagnetic Tomography image reconstruction. IEEE Sens. J. 2025, 25, 39933–39943. [Google Scholar] [CrossRef]
Zhou, H.; Wu, X.; Chen, H.; Chen, X.; He, X. Rsdehamba: Lightweight vision mamba for remote sensing satellite image dehazing. arXiv 2024, arXiv:2405.10030. [Google Scholar] [CrossRef]
Chi, K.; Guo, S.; Chu, J.; Li, Q.; Wang, Q. Rsmamba: Biologically plausible retinex-based mamba for remote sensing shadow removal. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5606310. [Google Scholar] [CrossRef]
Dong, J.; Yin, H.; Li, H.; Li, W.; Zhang, Y.; Khan, S.; Khan, F.S. Dual hyperspectral mamba for efficient spectral compressive imaging. arXiv 2024, arXiv:2406.00449. [Google Scholar] [CrossRef]
Zhang, C.; Wang, F.; Zhang, X.; Wang, M.; Wu, X.; Dang, S. Mamba-CR: A state-space model for remote sensing image cloud removal. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5601913. [Google Scholar] [CrossRef]
Liu, J.; Pan, B.; Shi, Z. CR-Famba: A frequency-domain assisted mamba for thin cloud removal in optical remote sensing imagery. IEEE Trans. Multimed. 2025, 27, 5659–5668. [Google Scholar] [CrossRef]
Wu, T.; Zhao, R.; Lv, M.; Jia, Z.; Li, L.; Liu, M.; Zhao, X.; Ma, H.; Vivone, G. Efficient Mamba-Attention Network for Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5627814. [Google Scholar] [CrossRef]
Liu, W.; Luo, B.; Liu, J.; Nie, H.; Su, X. FEMNet: A Feature-Enriched Mamba Network for Cloud Detection in Remote Sensing Imagery. Remote Sens. 2025, 17, 2639. [Google Scholar] [CrossRef]
Huang, Y.; Miyazaki, T.; Liu, X.; Omachi, S. Irsrmamba: Infrared image super-resolution via mamba-based wavelet transform feature modulation model. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5005416. [Google Scholar] [CrossRef]
Weng, M.; Liu, J.; Yang, J.; Wu, Z.; Xiao, L. Range-Null Space Decomposition with Frequency-Oriented Mamba for Spectral Super-Resolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 10292–10306. [Google Scholar] [CrossRef]
Meng, S.; Gong, W.; Li, S.; Song, G.; Yang, J.; Ding, Y. CDWMamba: Cloud Detection with Wavelet-Enhanced Mamba for Optical Satellite Imagery. Remote Sens. 2025, 17, 1874. [Google Scholar] [CrossRef]
Li, M.; Xiong, C.; Gao, Z.; Ma, J. HAM: Hierarchical Attention Mamba with Spatial-Frequency Fusion for Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5641314. [Google Scholar] [CrossRef]
Wang, Y.; Li, Y.; Yang, X.; Jiang, R.; Zhang, L. HDAMNet: Hierarchical Dilated Adaptive Mamba Network for Accurate Cloud Detection in Satellite Imagery. Remote Sens. 2025, 17, 2992. [Google Scholar] [CrossRef]
Zhi, R.; Fan, X.; Shi, J. MambaFormerSR: A lightweight model for remote-sensing image super-resolution. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6015705. [Google Scholar] [CrossRef]
Xue, T.; Zhao, J.; Li, J.; Chen, C.; Zhan, K. CD-Mamba: Cloud detection with long-range spatial dependency modeling. J. Appl. Remote Sens. 2025, 19, 038507. [Google Scholar] [CrossRef]
Xu, Y.; Wang, H.; Zhou, F.; Luo, C.; Sun, X.; Rahardja, S.; Ren, P. MambaHSISR: Mamba hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5511216. [Google Scholar] [CrossRef]
Zhu, Q.; Zhang, G.; Zou, X.; Wang, X.; Huang, J.; Li, X. Convmambasr: Leveraging state-space models and cnns in a dual-branch architecture for remote sensing imagery super-resolution. Remote Sens. 2024, 16, 3254. [Google Scholar] [CrossRef]
Chu, J.; Chi, K.; Wang, Q. RMMamba: Randomized Mamba for Remote Sensing Shadow Removal. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5634810. [Google Scholar] [CrossRef]
Sui, T.; Xiang, G.; Chen, F.; Li, Y.; Tao, X.; Zhou, J.; Hong, J.; Qiu, Z. U-Shaped Dual Attention Vision Mamba Network for Satellite Remote Sensing Single-Image Dehazing. Remote Sens. 2025, 17, 1055. [Google Scholar] [CrossRef]
Zhao, Z.; Gao, Q.; Yan, J.; Li, C.; Tang, J. HSFMamba: Hierarchical selective fusion Mamba network for optics-guided joint super-resolution and denoising of noise-corrupted SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 16445–16461. [Google Scholar] [CrossRef]
Duan, P.; Luo, Y.; Kang, X.; Li, S. LaMamba: Linear Attention Mamba for Hyperspectral Image Denoising. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5527113. [Google Scholar] [CrossRef]
Xie, Z.; Miao, G.; Chang, H. MTSR: Mamba-Transformer Super-Resolution Model for Hyperspectral Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 23256–23272. [Google Scholar] [CrossRef]
Xin, X.; Deng, Y.; Huang, W.; Wu, Y.; Fang, J.; Wang, J. Multi-Pattern Scanning Mamba for Cloud Removal. Remote Sens. 2025, 17, 3593. [Google Scholar] [CrossRef]
Li, C.; Pan, Z.; Hong, D. Dynamic State-Control Modeling for Generalized Remote Sensing Image Super-Resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 3076–3084. [Google Scholar]
Si, P.; Jia, M.; Wang, H.; Wang, J.; Sun, L.; Fu, Z. DC-Mamba: A Degradation-Aware Cross-Modality Framework for Blind Super-Resolution of Thermal UAV Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5005815. [Google Scholar] [CrossRef]
Wu, S.; He, X.; Chen, X. Weamba: Weather-Degraded Remote Sensing Image Restoration with Multi-Router State Space Model. Remote Sens. 2025, 17, 458. [Google Scholar] [CrossRef]
Deng, N.; Han, J.; Ding, H.; Liu, D.; Zhang, Z.; Song, W.; Tong, X. OSSMDNet: An Omni-Selective Scanning Mechanism for a Remote Sensing Image Denoising Network Based on the State-Space Model. Remote Sens. 2025, 17, 2759. [Google Scholar] [CrossRef]
Zhu, Z.; Chen, Y.; Zhang, S.; Luo, G.; Zeng, J. Mamba-Based Unet for Hyperspectral Image Denoising. IEEE Signal Process. Lett. 2025, 32, 1411–1415. [Google Scholar] [CrossRef]
Chen, C.; Li, J.; Liu, X.; Yuan, Q.; Zhang, L. Bidirectional-Aware Network Combining Transformer and Mamba for Hyperspectral Image Denoising. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5514316. [Google Scholar] [CrossRef]
Fu, H.; Sun, G.; Li, Y.; Ren, J.; Zhang, A.; Jing, C.; Ghamisi, P. HDMba: Hyperspectral remote sensing imagery dehazing with state space model. arXiv 2024, arXiv:2406.05700. [Google Scholar] [CrossRef]
Shao, M.; Tan, X.; Shang, K.; Liu, T.; Cao, X. A Hybrid Model of State Space Model and Attention for Hyperspectral Image Denoising. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 9904–9918. [Google Scholar] [CrossRef]
Liu, Y.; Xiao, J.; Song, X.; Guo, Y.; Jiang, P.; Yang, H.; Wang, F. HSIDMamba: Exploring bidirectional state-space models for hyperspectral denoising. arXiv 2024, arXiv:2404.09697. [Google Scholar] [CrossRef]
Luan, X.; Fan, H.; Wang, Q.; Yang, N.; Liu, S.; Li, X.; Tang, Y. FMambaIR: A hybrid state space model and frequency domain for image restoration. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4201614. [Google Scholar] [CrossRef]
Qiu, L.; Xie, F.; Liu, C.; Che, X.; Shi, Z. Radiation-Tolerant Unsupervised Deep Image Stitching for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5642121. [Google Scholar] [CrossRef]
Yang, M.; Jiang, S.; Jiang, W.; Li, Q. Mamba-based Feature Extraction and Multi-Frequency Information Fusion for Stereo Matching of High-Resolution Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 23273–23288. [Google Scholar] [CrossRef]
Li, B.; Zhao, H.; Wang, W.; Hu, P.; Gou, Y.; Peng, X. Mair: A locality-and continuity-preserving mamba for image restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 7491–7501. [Google Scholar]
Fu, G.; Xiong, F.; Lu, J.; Zhou, J. SSUMamba: Spatial-spectral selective state space model for hyperspectral image denoising. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5527714. [Google Scholar] [CrossRef]
Patnaik, N.; Nayak, N.; Agrawal, H.B.; Khamaru, M.C.; Bal, G.; Panda, S.S.; Raj, R.; Meena, V.; Vadlamani, K. Small Vision-Language Models: A Survey on Compact Architectures and Techniques. arXiv 2025, arXiv:2503.10665. [Google Scholar]
Li, S.; Tang, H. Multimodal alignment and fusion: A survey. arXiv 2024, arXiv:2411.17040. [Google Scholar] [CrossRef]
Meng, L.; Wang, J.; Huang, Y.; Xiao, L. RSIC-GMamba: A State Space Model with Genetic Operations for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4702216. [Google Scholar] [CrossRef]
Liu, C.; Chen, K.; Chen, B.; Zhang, H.; Zou, Z.; Shi, Z. Rscama: Remote sensing image change captioning with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6010405. [Google Scholar] [CrossRef]
Liu, C.; Zhang, J.; Chen, K.; Wang, M.; Zou, Z.; Shi, Z. Remote Sensing Spatiotemporal Vision–Language Models: A comprehensive survey. IEEE Geosci. Remote Sens. Mag. 2025. early access. [Google Scholar] [CrossRef]
Chen, K.; Liu, C.; Chen, B.; Li, W.; Zou, Z.; Shi, Z. Dynamicvis: An efficient and general visual foundation model for remote sensing image understanding. arXiv 2025, arXiv:2503.16426. [Google Scholar] [CrossRef]
He, X.; Cao, K.; Zhang, J.; Yan, K.; Wang, Y.; Li, R.; Xie, C.; Hong, D.; Zhou, M. Pan-mamba: Effective pan-sharpening with state space model. Inf. Fusion 2025, 115, 102779. [Google Scholar] [CrossRef]
Wang, Y.; Liang, F.; Wang, S.; Chen, H.; Cao, Q.; Fu, H.; Chen, Z. Towards an Efficient Remote Sensing Image Compression Network with Visual State Space Model. Remote Sens. 2025, 17, 425. [Google Scholar] [CrossRef]
Fei, Z.; Fan, M.; Yu, C.; Li, D.; Zhang, Y.; Huang, J. Dimba: Transformer-mamba diffusion models. arXiv 2024, arXiv:2406.01159. [Google Scholar] [CrossRef]
Peng, X.; Zhou, J.; Wu, X. Distillation-Based Cross-Model Transferable Adversarial Attack for Remote Sensing Image Classification. Remote Sens. 2025, 17, 1700. [Google Scholar] [CrossRef]
Dewis, Z.; Xu, Z.; Zhu, Y.; Alkayid, M.; Heffring, M.; Xu, L.L. Spatial-Temporal-Spectral Mamba with Sparse Deformable Token Sequence for Enhanced MODIS Time Series Classification. arXiv 2025, arXiv:2508.02839. [Google Scholar] [CrossRef]
Li, D.; Bhatti, U.A. MSTFNet: A Mamba and Dual Swin-Transformer Fusion Network for Remote Sensing Image Classification for Precision Agriculture Land Processing. 2024. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5033170 (accessed on 19 November 2025).
Zhao, M.; Wang, D.; Zhang, G.; Cao, W.; Xu, S.; Li, Z.; Liu, X. Evaluating Maize Emergence Quality with Multi-task YOLO11-Mamba and UAV-RGB Remote Sensing. Smart Agric. Technol. 2025, 12, 101351. [Google Scholar] [CrossRef]
Li, J.; Yang, C.; Zhu, C.; Qin, T.; Tu, J.; Wang, B.; Yao, J.; Qiao, J. CMRNet: An Automatic Rapeseed Counting and Localization Method Based on the CNN-Mamba Hybrid Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 19051–19065. [Google Scholar] [CrossRef]
Li, H.; Zhao, F.; Xue, F.; Wang, J.; Liu, Y.; Chen, Y.; Wu, Q.; Tao, J.; Zhang, G.; Xi, D.; et al. Succulent-YOLO: Smart UAV-Assisted Succulent Farmland Monitoring with CLIP-Based YOLOv10 and Mamba Computer Vision. Remote Sens. 2025, 17, 2219. [Google Scholar] [CrossRef]
Zhang, X.; Gu, J.; Azam, B.; Zhang, W.; Lin, M.; Li, C.; Jing, W.; Akhtar, N. RSVMamba for Tree Species Classification Using UAV RGB Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5607716. [Google Scholar] [CrossRef]
Zheng, J.; Fu, Y.; Chen, X.; Zhao, R.; Lu, J.; Zhao, H.; Chen, Q. EGCM-UNet: Edge Guided Hybrid CNN-Mamba UNet for farmland remote sensing image semantic segmentation. Geocarto Int. 2025, 40, 2440407. [Google Scholar] [CrossRef]
Park, J.; Kim, H.S.; Ko, K.; Kim, M.; Kim, C. VideoMamba: Spatio-temporal selective state space model. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 1–18. [Google Scholar]
Li, Y.; Wang, Y.; Shao, X.; Zheng, A. An efficient fire detection algorithm based on Mamba space state linear attention. Sci. Rep. 2025, 15, 11289. [Google Scholar] [CrossRef]
Ho, Y.H.; Mostafavi, A. Multimodal Mamba with multitask learning for building flood damage assessment using synthetic aperture radar remote sensing imagery. Comput.-Aided Civ. Infrastruct. Eng. 2025, 40, 4401–4424. [Google Scholar] [CrossRef]
Ho, Y.H.; Mostafavi, A. Flood-DamageSense: Multimodal Mamba with Multitask Learning for Building Flood Damage Assessment using SAR Remote Sensing Imagery. arXiv 2025, arXiv:2506.06667. [Google Scholar]
Tang, X.; Lu, Z.; Fan, X.; Yan, X.; Yuan, X.; Li, D.; Li, H.; Li, H.; Meena, S.R.; Novellino, A.; et al. Mamba for landslide detection: A lightweight model for mapping landslides with very high-resolution images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5637117. [Google Scholar] [CrossRef]
Shao, Y.; Xu, L. Multimodal Natural Disaster Scene Recognition with Integrated Large Model and Mamba. Appl. Sci. 2025, 15, 1149. [Google Scholar] [CrossRef]
Andrianarivony, H.S.; Akhloufi, M.A. LinU-Mamba: Visual Mamba U-Net with Linear Attention to Predict Wildfire Spread. Remote Sens. 2025, 17, 2715. [Google Scholar] [CrossRef]
Li, W.; Ma, G.; Zhang, H.; Chen, P.; Wang, D.; Chen, R. Multi-scenario building change detection in remote sensing images using CNN-Mamba hybrid network and consistency enhancement learning. Expert Syst. Appl. 2025, 298, 129843. [Google Scholar] [CrossRef]
Chen, S.; Wang, F.; Ren, P.; Luo, C.; Fu, Z. OSDMamba: Enhancing Oil Spill Detection from Remote Sensing Images Using Selective State Space Model. arXiv 2025, arXiv:2506.18006. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, S.; Chen, Y.; Wei, S.; Xu, M.; Liu, S. Algae-Mamba: A Spatially Variable Mamba for Algae Extraction from Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 14324–14337. [Google Scholar] [CrossRef]
Zhang, X.; Ma, Y.; Zhang, F.; Li, Z.; Zhang, J. Multi-Model Synergistic Satellite-Derived Bathymetry Fusion Approach Based on Mamba Coral Reef Habitat Classification. Remote Sens. 2025, 17, 2134. [Google Scholar] [CrossRef]
Sha, P.; Lu, S.; Xu, Z.; Yu, J.; Li, L.; Zou, Y.; Zhao, L. OWTDNet: A Novel CNN-Mamba Fusion Network for Offshore Wind Turbine Detection in High-Resolution Remote Sensing Images. J. Mar. Sci. Eng. 2025, 13, 2124. [Google Scholar] [CrossRef]
Jiang, X.; Wang, S.; Li, W.; Yang, H.; Guan, J.; Zhang, Y.; Zhou, S. STDMamba: Spatio-Temporal Decomposition Mamba for Long-Term Fine-Grained SST Prediction. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4212616. [Google Scholar] [CrossRef]
Shi, X.; Ni, W.; Duan, B.; Su, Q.; Liu, L.; Ren, K. MMamba: An Efficient Multimodal Framework for Real-Time Ocean Surface Wind Speed Inpainting Using Mutual Information and Attention-Mamba-2. Remote Sens. 2025, 17, 3091. [Google Scholar] [CrossRef]
Sun, Y.; Song, J.; Cai, Z.; Xiao, L. Tracking Mamba for Road Extraction From Satellite Imagery. IEEE Geosci. Remote Sens. Lett. 2025, 22, 6014305. [Google Scholar] [CrossRef]
Wang, Z.; Yuan, S.; Li, R.; Xu, N.; You, Z.; Huang, D. FDMamba: Frequency-Driven Dual-Branch Mamba Network for Road Extraction From Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5643419. [Google Scholar] [CrossRef]
Li, B.; Shen, C.; Gu, S.; Zhao, Y.; Xiao, F. Explicitly Integrated Multi-Task Learning in a Hybrid Network for Remote Sensing Road Extraction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 21186–21199. [Google Scholar] [CrossRef]
Zhao, S.; Wang, F.; Huang, X.; Yang, X.; Jiang, N.; Peng, J.; Ban, Y. Mamba-UNet: Dual-Branch Mamba Fusion U-Net With Multiscale Spatio-Temporal Attention for Precipitation Nowcasting. IEEE Trans. Ind. Inform. 2025, 21, 4466–4475. [Google Scholar] [CrossRef]
Zhang, J.; Chen, M.; Zhao, Y.; Shan, L.; Li, C.; Hu, H.; Ge, X.; Zhu, Q.; Xu, B. Asymmetric Mamba-CNN Collaborative Architecture for Large-Size Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 2002419. [Google Scholar] [CrossRef]
Liu, Z.; Chen, H.; Bai, L.; Li, W.; Ouyang, W.; Zou, Z.; Shi, Z. Mambads: Near-surface meteorological field downscaling with topography constrained selective state space modeling. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4112615. [Google Scholar] [CrossRef]
Ma, X.; Lv, Z.; Ma, C.; Zhang, T.; Xin, Y.; Zhan, K. BS-Mamba for black-soil area detection on the Qinghai-Tibetan plateau. J. Appl. Remote Sens. 2025, 19, 028502. [Google Scholar] [CrossRef]
Liu, Y.; Shi, H.; Cao, K.; Wu, S.; Ye, H.; Wang, X.; Sun, E.; Han, Y.; Xiong, W. kMetha-Mamba: K-means clustering mamba for methane plumes segmentation. Int. J. Appl. Earth Obs. Geoinf. 2025, 142, 104664. [Google Scholar] [CrossRef]
Yu, W.; Wang, X. Mambaout: Do we really need mamba for vision? In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 4484–4496. [Google Scholar]
Xiao, C.; Li, M.; Zhang, Z.; Meng, D.; Zhang, L. Spatial-mamba: Effective visual state space models via structure-aware state fusion. arXiv 2024, arXiv:2410.15091. [Google Scholar]
Hamdan, E.; Pan, H.; Cetin, A.E. Sparse Mamba: Introducing Controllability, Observability, And Stability To Structural State Space Models. arXiv 2024, arXiv:2409.00563. [Google Scholar]
Shi, Y.; Li, M.; Dong, M.; Xu, C. Vssd: Vision mamba with non-causal state space duality. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–25 October 2025; pp. 10819–10829. [Google Scholar]
Ding, H.; Xia, B.; Liu, W.; Zhang, Z.; Zhang, J.; Wang, X.; Xu, S. A novel mamba architecture with a semantic transformer for efficient real-time remote sensing semantic segmentation. Remote Sens. 2024, 16, 2620. [Google Scholar] [CrossRef]
Díaz, A.H.; Davidson, R.; Eckersley, S.; Bridges, C.P.; Hadfield, S.J. E-mamba: Using state-space-models for direct event processing in space situational awareness. In Proceedings of the SPAICE 2024: The First Joint European Space Agency/IAA Conference on AI in and for Space, Harwell, UK, 17–19 September 2024; pp. 509–514. [Google Scholar]
Sedeh, M.A.; Sharifian, S. EdgePVM: A serverless satellite edge computing constellation for changes detection using onboard Parallel siamese Vision MAMBA. Future Gener. Comput. Syst. 2025, 174, 107985. [Google Scholar] [CrossRef]
Jiang, F.; Pan, C.; Dong, L.; Wang, K.; Debbah, M.; Niyato, D.; Han, Z. A comprehensive survey of large ai models for future communications: Foundations, applications and challenges. arXiv 2025, arXiv:2505.03556. [Google Scholar] [CrossRef]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Hu, Z.; Daryakenari, N.A.; Shen, Q.; Kawaguchi, K.; Karniadakis, G.E. State-space models are accurate and efficient neural operators for dynamical systems. arXiv 2024, arXiv:2409.03231. [Google Scholar] [CrossRef]
Cheng, C.W.; Huang, J.; Zhang, Y.; Yang, G.; Schönlieb, C.B.; Aviles-Rivero, A.I. Mamba neural operator: Who wins? transformers vs. state-space models for pdes. arXiv 2024, arXiv:2410.02113. [Google Scholar] [CrossRef]
Liu, C.; Zhao, B.; Ding, J.; Wang, H.; Li, Y. Mamba Integrated with Physics Principles Masters Long-term Chaotic System Forecasting. arXiv 2025, arXiv:2505.23863. [Google Scholar] [CrossRef]
Li, S.; Singh, H.; Grover, A. Mamba-nd: Selective state space modeling for multi-dimensional data. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 75–92. [Google Scholar]
Qin, H.; Chen, Y.; Jiang, Q.; Sun, P.; Ye, X.; Lin, C. Metmamba: Regional weather forecasting with spatial-temporal mamba model. arXiv 2024, arXiv:2408.06400. [Google Scholar]
Eddin, M.H.S.; Zhang, Y.; Kollet, S.; Gall, J. RiverMamba: A State Space Model for Global River Discharge and Flood Forecasting. In Proceedings of the Thirty-ninth Annual Conference on Neural Information Processing Systems, San Diego, CA, USA, 2–7 December 2025. [Google Scholar]
Rasp, S.; Pritchard, M.S.; Gentine, P. Deep learning to represent subgrid processes in climate models. Proc. Natl. Acad. Sci. USA 2018, 115, 9684–9689. [Google Scholar] [CrossRef]
Yuval, J.; O’Gorman, P.A. Stable machine-learning parameterization of subgrid processes for climate modeling at a range of resolutions. Nat. Commun. 2020, 11, 3295. [Google Scholar] [CrossRef] [PubMed]
Kochkov, D.; Yuval, J.; Langmore, I.; Norgaard, P.; Smith, J.; Mooers, G.; Klöwer, M.; Lottes, J.; Rasp, S.; Düben, P.; et al. Neural general circulation models for weather and climate. Nature 2024, 632, 1060–1066. [Google Scholar] [CrossRef] [PubMed]
Bock, F.E.; Keller, S.; Huber, N.; Klusemann, B. Hybrid modelling by machine learning corrections of analytical model predictions towards high-fidelity simulation solutions. Materials 2021, 14, 1883. [Google Scholar] [CrossRef]
Beucler, T.; Koch, E.; Kotlarski, S.; Leutwyler, D.; Michel, A.; Koh, J. Next-generation earth system models: Towards reliable hybrid models for weather and climate applications. arXiv 2023, arXiv:2311.13691. [Google Scholar]
Huo, C.; Chen, K.; Zhang, S.; Wang, Z.; Yan, H.; Shen, J.; Hong, Y.; Qi, G.; Fang, H.; Wang, Z. When Remote Sensing Meets Foundation Model: A Survey and Beyond. Remote Sens. 2025, 17, 179. [Google Scholar] [CrossRef]
Xiao, A.; Xuan, W.; Wang, J.; Huang, J.; Tao, D.; Lu, S.; Yokoya, N. Foundation models for remote sensing and earth observation: A survey. IEEE Geosci. Remote Sens. Mag. 2025, 13, 297–324. [Google Scholar] [CrossRef]
Cong, Y.; Khanna, S.; Meng, C.; Liu, P.; Rozi, E.; He, Y.; Burke, M.; Lobell, D.; Ermon, S. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Adv. Neural Inf. Process. Syst. 2022, 35, 197–211. [Google Scholar]
Reed, C.J.; Gupta, R.; Li, S.; Brockman, S.; Funk, C.; Clipp, B.; Keutzer, K.; Candido, S.; Uyttendaele, M.; Darrell, T. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4088–4099. [Google Scholar]
Mendieta, M.; Han, B.; Shi, X.; Zhu, Y.; Chen, C.; Li, M. GFM: Building geospatial foundation models via continual pretraining. arXiv 2023, arXiv:2302.04476. [Google Scholar]
Shi, Z.; Zhao, C.; Wang, K.; Kong, X.; Zhu, J. Geo-Mamba: A data-driven Mamba framework for spatiotemporal modeling with multi-source geographic factor integration. Int. J. Appl. Earth Obs. Geoinf. 2025, 144, 104854. [Google Scholar] [CrossRef]
Gong, S.; Zhuge, Y.; Zhang, L.; Wang, Y.; Zhang, P.; Wang, L.; Lu, H. Avs-mamba: Exploring temporal and multi-modal mamba for audio-visual segmentation. IEEE Trans. Multimed. 2025, 27, 5413–5425. [Google Scholar] [CrossRef]
Zhou, G.; Qian, L.; Gamba, P. Advances on multimodal remote sensing foundation models for Earth observation downstream tasks: A survey. Remote Sens. 2025, 17, 3532. [Google Scholar] [CrossRef]
Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16344–16359. [Google Scholar]
Choquette, J. Nvidia hopper h100 gpu: Scaling performance. IEEE Micro 2023, 43, 9–17. [Google Scholar] [CrossRef]
Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv 2023, arXiv:2307.08691. [Google Scholar] [CrossRef]
Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. Mixed precision training. arXiv 2017, arXiv:1710.03740. [Google Scholar] [CrossRef]
Yu, A.; Erichson, N.B. Block-Biased Mamba for Long-Range Sequence Processing. arXiv 2025, arXiv:2505.09022. [Google Scholar]
Zhang, J.; Nguyen, A.T.; Han, X.; Trinh, V.Q.H.; Qin, H.; Samaras, D.; Hosseini, M.S. 2DMamba: Efficient state space model for image representation with applications on giga-pixel whole slide image classification. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 3583–3592. [Google Scholar]
Strubell, E.; Ganesh, A.; McCallum, A. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 3645–3650. [Google Scholar]
Lacoste, A.; Luccioni, A.; Schmidt, V.; Dandres, T. Quantifying the carbon emissions of machine learning. arXiv 2019, arXiv:1910.09700. [Google Scholar] [CrossRef]
Lannelongue, L.; Grealey, J.; Inouye, M. Green algorithms: Quantifying the carbon footprint of computation. Adv. Sci. 2021, 8, 2100707. [Google Scholar] [CrossRef] [PubMed]
Bouza, L.; Bugeau, A.; Lannelongue, L. How to estimate carbon footprint when training deep learning models? A guide and review. Environ. Res. Commun. 2023, 5, 115014. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Conceptual evolution of Mamba-style state-space models and their vision and Earth observation variants [12,13,14,15,22,23,24,25,26,27,28,29,30,31].

Figure 2. Theoretical compute scaling and accuracy–cost trade-offs of CNN, Transformer, and Mamba long-range operators.

Figure 3. (a) Raster scan (baseline), (b) bidirectional scan, (c) four-way selective scan with four orthogonal directions, (d) zigzag scan forming continuous serpentine directions, (e) Hilbert-curve/fractal scan that preserves strong spatial locality across the grid, (f) window-based continuous scan with locally connected U-shaped trajectories, (g) omnidirectional spectral scan (OS-Scan) combining row-wise, column-wise and ±diagonal trajectories, and (h) dynamic adaptive scan (DAS) whose path follows a data-driven importance map and largely skips low-importance regions. Arrow colors distinguish different scan branches/directions (e.g., forward vs. reverse, or different orientations). Dashed segments indicate cross-row/cross-window transitions or auxiliary connections between local trajectories. Dots denote the starting/anchor positions of a scan within a direction or window. In (h), “×” marks skipped (low-importance) tokens, while the highlighted nodes/links indicate the visited (high-importance) tokens and their traversal order.

Figure 4. Integration paradigms of Mamba-style state space blocks in Earth observation hybrids. Shaded blocks denote local operators such as convolution or local attention, and unshaded blocks denote state space blocks implemented by selective scan. Solid arrows indicate the main feature flow, whereas dashed arrows indicate skip routing, fusion, or cross-branch interaction. (a) Serial insertion or replacement. (b) Parallel branches. (c) Bifurcated designs.

Figure 5. Qualitative comparison of different methods (CNN-, Transformer-, and Mamba-based) on the WHU-CD test set. (I) A complex change scenario involving both building emergence and disappearance. (II,III) Large-scale building change cases. (IV) Small-scale building changes in a complex setting. White denotes TPs, black denotes TNs, red indicates FPs, and green indicates FNs. Compared with other competitors, CD-STMamba better preserves fine-grained boundary details and reduces missed detections in challenging scenes. Reprinted from Ref. [210].

Table 2. Representative Mamba-based change detection methods and performance on WHU-CD dataset (Optical), together with the reported model complexity (Params and FLOPs). Symbol definitions and data sources are provided in Appendix A.2. SAM2-CD reports three model scales (S, B, L) corresponding to small, base, and large variants; SAM-Mamba similarly includes three backbone variants (ViT-B, Hiera-B, Hiera-L). N/R means Not Reported.

Method	Core Idea	F1 (%)	Params (M)	FLOPs (G)
SPMNet [222]	Siamese pyramid Mamba with hybrid fusion	91.80	17.33	6.94
VMMCD [225]	Lightweight design & feature-guiding fusion	92.52	4.93	4.51
SAM2-CD [237]	SAM2 adapter + activation selection gate	91.94/92.27/92.56	36.81/72.53/217.30	302.49/567.81/1670.28
GSSR-Net [227]	Geo-spatial structural refinement & wavelet	93.07	27.3	15.6
Mamba-LCD [242]	Illumination-aware state transitions	93.60	27.55	40.89
SMNet [216]	Semantic-guided Mamba with RWKV integration	93.95	46.03	19.41
ChangeMamba [209]	Spatiotemporal interaction on concatenated features	94.19	49.94	114.82
SAM-Mamba [236]	SAM2 encoder + Mamba decoder (two-stage)	94.05/93.82/94.83	85.22/88.01/233.33	72.59/74.21/131.14
2DMCG [212]	Change-flow guidance	95.07	N/R	N/R
CD-STMamba [210]	Spatio-temporal interaction module (STIM)	95.45	63.33	67.00

Table 3. Task-wise evolutionary trajectories of Mamba-based EO models, dominant inflection points and representative works.

Domain	Stage I: Drop-in SSM Backbone	Stage II: Scan/Sequence Redesign	Stage III: Hybrid Integration & Interaction	Stage IV: Domain priors/Objectives & Scaling	Representative Examples (as Reviewed)
Hyperspectral classification/unmixing	Replace CNN/attention mixers with Mamba blocks	Spectral–spatial serialization (band-wise/patch-wise/centre-focused)	CNN–Mamba/Transformer–Mamba hybrids	Few-shot/transfer; efficiency-tuned variants	Bi-MambaHSI; SpiralMamba; HSS-KAMNet
VHR optical/SAR segmentation	U-Net-like backbones with SSM mixers	Multi-directional/large-tile friendly scanning	Global–local hybrids (pyramids, multi-scale)	Boundary-aware/multi-scale objectives; lightweight deployment	PyramidMamba; RS-Mamba; LGMamba
Object detection/BEV segmentation	SSM backbones in detectors	Geometry-aware tokenization for oriented targets	FPN/neck + SSM backbones; hybrid heads	Small-object/orientation priors; tiling efficiency	DMM; RemoteDet-Mamba; RSBEV-Mamba
Change detection/multi-temporal	Siamese encoders with SSM mixers	Temporal interaction serialization/atrous-like scans	Hybrid convolution–Mamba interaction modules	Alignment/illumination priors; foundation adaptation	AtrousMamba; Mamba-LCD; EdgePVM
Image restoration/fusion	Insert SSM blocks into restorers	Window/patch serialization for high-res	Frequency/conv hybrids; multi-scale fusion	Degradation-aware priors; hardware-aware efficiency	Frequency-assisted Mamba; RSDehamba; Pan-Mamba
Captioning/VLMs & RSFMs	Use SSM as efficient backbone	Long-context/high-res token scheduling	Multimodal alignment modules over SSM features	Pretraining objectives; scaling laws & fair evaluation	RSIC-GMamba; RSCaMa; DynamicVis
Scientific/geophysical EO	SSM as sequence learner	Grid/time-field serialization	Operator-style SSM (neural-operator flavour)	Physics-informed constraints; solver coupling	STSMamba; Mamba-UNet; MMamba; Algae-Mamba

Table 4. Representative Transformer- and Mamba-based remote-sensing foundation models.

Model	Backbone	Modalities	Objective
SatMAE [349]	ViT	MSI + time	Spatio-temporal MAE
Scale-MAE [350]	ViT	Multiscale RGB/MSI	Scale-aware MAE
GFM [351]	ViT	RGB + ancillary	Continual pretraining
SatMamba [26]	Mamba	MSI + time	MAE with SSM encoder
DynamicVis [293]	Mamba	HR optical	Self-/supervised
RoMA [27]	Mamba	Optical	Rotation-aware SSL
RingMamba [28]	Mamba	Optical + SAR	Generative + contrastive
Geo-Mamba [352]	Mamba + KAN	Geophysical predictors	Spatiotemporal prediction
VMIC [295]	Mamba	Latent RS features	Rate–distortion

Table 5. Systematic but compact comparison of representative Mamba-style variants in EO.

Mamba-Style Variant	Illustration	Token-Mixing Complexity	Typical Accuracy Evidence	Typical Data Requirement	Common EO Applications
Single-pass scan SSM backbone	Figure 3; Table 1	O(LD)	Baseline comparison under matched backbones	Works in supervised; benefits from pretrain	General classification and segmentation backbones
Bidirectional scan SSM	Table 1	O(2LD)	Direction ablation; symmetry effects	Similar to single-pass	Dense prediction as an efficiency-oriented baseline
4-way cross-scan SSM	Figure 3; Table 1	O(4LD)	Scan-choice ablation; gains on large tiles	Moderate	Segmentation and change detection on large scenes
6–8-way omni-scan SSM	Figure 3; Table 1	O(KLD) with K up to 8	Direction-set ablation; improved isotropy	Moderate to high	Large-scene perception; HSI spectral–spatial tasks
Window or block scan SSM	Figure 3; Table 1	O(LD) with block scheduling	Window size and stride ablation	Moderate	Restoration, fusion, scalable tiling pipelines
Adaptive or geometry-aware scan SSM	Figure 3	O(LD) with routing overhead	Qualitative maps plus scan-policy ablation	Requires reliable cues; may be shift-sensitive	Structure-aware tasks such as linear networks and boundaries
CNN–Mamba hybrid	Section 2.3	O(LD) plus CNN stages	mIoU, F1, mAP vs. CNN baselines	Robust under limited labels	Segmentation, detection, change detection
Transformer–Mamba hybrid	Section 2.3	Mixed, attention at coarse stages	Matched-budget comparisons; memory and latency	Often relies on pretraining	Large-tile models; multimodal fusion pipelines
Mamba-based RSFMs and SSL encoders	Table 4; Section 6.4	O(LD) encoder mixing	Downstream transfer and label efficiency	Large-scale pretraining	Foundation models; cross-modal pretraining and transfer

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.