A Multilevel Multimodal Hybrid Mamba-Large Strip Convolution Network for Remote Sensing Semantic Segmentation

Yan, Lingyu; Feng, Qingyang; Wang, Jing; Cao, Jinshan; Feng, Xiaoxiao; Tang, Xing

doi:10.3390/rs17152696

Open AccessArticle

A Multilevel Multimodal Hybrid Mamba-Large Strip Convolution Network for Remote Sensing Semantic Segmentation

by

Lingyu Yan

^1,2

,

Qingyang Feng

^1,2,

Jing Wang

^1,2,

Jinshan Cao

^1,2,*

,

Xiaoxiao Feng

^1,2 and

Xing Tang

³

¹

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

²

School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan 430070, China

³

Key Laboratory of Green Intelligent Computing Network in Hubei Province, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2696; https://doi.org/10.3390/rs17152696

Submission received: 6 June 2025 / Revised: 27 July 2025 / Accepted: 1 August 2025 / Published: 4 August 2025

(This article belongs to the Special Issue Remote Sensing Image Classification and Semantic Segmentation (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

Semantic segmentation is one of the key tasks in the intelligent interpretation of remote sensing images with extensive potential applications. However, when ultra-high resolution (UHR) remote sensing images exhibit complex background intersections and significant variations in object sizes, existing multimodal fusion segmentation methods based on convolutional neural networks and Transformers face challenges such as limited receptive fields and high secondary complexity, leading to inadequate global context modeling and multimodal feature representation. Moreover, the lack of accurate boundary detail feature constraints in the final segmentation further limits segmentation accuracy. To address these challenges, we propose a novel boundary-enhanced multilevel multimodal fusion Mamba-Large Strip Convolution network (FMLSNet) for remote sensing image segmentation, which offers the advantages of a global receptive field and efficient linear complexity. Specifically, this paper introduces a new multistage Mamba multimodal fusion framework (FMB) for UHR remote sensing image segmentation. By employing an innovative multimodal scanning mechanism integrated with disentanglement strategies to deepen the fusion process, FMB promotes deep fusion of multimodal features and captures cross-modal contextual information at multiple levels, enabling robust and comprehensive feature integration with enriched global semantic context. Additionally, we propose a Large Strip Spatial Detail (LSSD) extraction module, which adaptively combines multi-directional large strip convolutions to capture more precise and fine-grained boundary features. This enables the network to learn detailed spatial features from shallow layers. A large number of experimental results on challenging remote sensing image datasets show that our method exhibits superior performance over state-of-the-art models.

Keywords:

multimodal remote sensing; large strip convolution; visual state space model; semantic segmentation

1. Introduction

With the rapid development of image sensor technology and the aerospace industry, ultra-high resolution (UHR) remote sensing images have generated massive amounts of data, which brings significant pressure on the analysis of remote sensing image data. Semantic segmentation is a prerequisite necessary task for remote sensing image interpretation; the quality of segmented remote sensing images directly determines the possibility of high-level expression and application of subsequent remote sensing images. Therefore, as a key technology in the field of remote sensing image research, semantic segmentation is widely used in various popular remote sensing image processing research areas, including land cover mapping [1,2], wildfire detection [3], change detection [4,5], object extraction [6,7], and disaster management [8,9], which has become a hot research topic with significant research significance and research value.

In recent years, with the rapid development of deep learning techniques, the semantic segmentation of ultra-high-resolution (UHR) remote sensing images has made significant progress by introducing convolutional neural networks (CNNs) [10,11,12]. However, despite their excellent performance in many aspects, these methods remain constrained by the intrinsic properties of convolutional kernels, which have a limited receptive field and local feature extraction capability. This limitation often results in the loss of fine details, making it difficult to recognize and segment small-scale objects with large variations in size. To overcome this issue, the popular Transformer [13] architecture has begun to attract attention from researchers in remote sensing image semantic segmentation. Leveraging its powerful backbone, Transformer-based semantic segmentation methods for remote sensing images [14,15,16] have achieved even more outstanding performance. Through an in-depth analysis of the impact of remote sensing image data modalities and structures on semantic segmentation, it has been found that single-modality remote sensing image data can only provide information from a single perspective, limiting further improvements in feature extraction results [17]. With the continuous advancement of Earth observation technologies, multi-source remote sensing data—such as optical images, digital surface models (DSM), and multispectral and hyperspectral images—can perceive the same scene from different dimensions, offering more comprehensive and complementary information. This effectively breaks through the cognitive bottleneck of traditional single-modality imaging mechanisms [18]. Therefore, fusing multi-modal remote sensing data and exploring more efficient, precise, and robust semantic segmentation methods have become a key research direction, providing new opportunities to enhance the performance of intelligent remote sensing interpretation tasks [19,20,21].

In the task of multimodal UHR remote sensing image semantic segmentation, recent years have witnessed a surge of studies leveraging deep learning models for feature extraction and fusion across multi-source remote sensing data. Most existing approaches design specific network architectures to extract modality-specific features or generate attention weight maps, followed by simple fusion strategies such as weighted summation or channel concatenation to integrate multimodal features [22]. Although these methods have led to performance improvements to a certain extent, their fusion mechanisms remain relatively coarse and often fail to adequately capture the inherent structural heterogeneity across different modalities, which presents substantial limitations when applied to remote sensing imagery [23]. As illustrated in Figure 1, compared with natural images, remote sensing scenes exhibit significant intra-class variations in terms of scale, shape, and texture. Moreover, object boundaries tend to be ambiguous, and class confusion is widespread, especially for small objects whose fine details are often neglected [24]. These characteristics impose stricter requirements on fusion modules in terms of discriminative capability and fine-grained information preservation [25]. However, due to the insufficient ability of existing fusion strategies to model cross-modal feature interactions, most current methods tend to lose crucial details when handling complex scenes, making it difficult to achieve effective joint modeling of both local and global semantics [26]. As a critical spatial structural cue in multimodal remote sensing imagery, boundary features play a vital role in improving cross-modal alignment precision and guiding segmentation networks to focus on object edge regions [27]. They not only enhance feature discriminability during the fusion process but also help mitigate semantic ambiguity caused by blurred boundaries and occlusions, thereby preserving essential spatial details [28]. Nevertheless, most existing boundary modeling approaches still rely on CNN-based structures, which are constrained by the limited receptive field of convolutional kernels [29]. This limitation hampers their ability to accurately capture fine-grained boundary contours [30]. Recent studies have demonstrated that employing large-kernel convolutions can effectively expand the receptive field and improve the representation of edge regions [31]. However, in remote sensing imagery, overly large convolution kernels may introduce redundant background noise, thereby degrading target recognition and segmentation accuracy.

While CNNs and Transformers have been widely applied in remote sensing image semantic segmentation tasks based on fusion methods, convolutional neural networks are hindered by their limited receptive fields, which restricts their ability to capture global context and overlook long-range cross-modal dependencies at different feature levels [33]. In contrast, Transformers excel at extracting global features by capturing long-distance spatial dependencies, thereby accurately representing global semantic information, but their computational costs are prohibitively high [34]. Recently, the novel architecture Mamba [35] from state-space models (SSM) [36] has garnered significant attention due to its remarkable performance in modeling long-distance interactions while maintaining linear computational complexity. However, unlike sequential data (e.g., text), multimodal remote sensing data lack the necessary causal relationships inherent in sequence data. When processing multimodal remote sensing image inputs, the scan operation originally designed for text sequence data may fail to capture key features of multimodal information while generating one-dimensional sequence data. Consequently, directly applying the Mamba framework model [10,37] for fusion modeling in remote sensing segmentation may inevitably lead to performance degradation. Therefore, exploring how to customize the Mamba framework for multimodal fusion in the field of remote sensing image segmentation is both crucial and meaningful.

1.1. Related Works

Driven by advances in Earth observation technology, Multimodal data (e.g., optical imagery, DSM, multispectral, etc.) have gradually received widespread attention, which has been driven by the advancement of Earth observation technology. Multimodal fusion provides more comprehensive and richer surface feature information, thereby significantly enhancing model performance. For instance, Hazirbas et al. [20] and Audebert et al. [21] proposed models based on dual-branch structures, which process different modality features in parallel. While these methods improved fusion performance, their simple element-wise addition strategy still has limitations. Subsequently, Seichter et al. [38] stacked RGB and DSM data into four-channel inputs. However, this straightforward channel-stacking approach failed to exploit the complementary nature of multimodal features fully. To further enhance feature fusion, Hosseinpour et al. [39] and Chen et al. [40] introduced gated fusion and cross-layer gating mechanisms to better accommodate complex multimodal data scenarios. With the introduction of Transformer technology, Transformer-based multimodal fusion methods have made remarkable progress in remote sensing image segmentation tasks. He et al. [23] enhance the self-attention module by incorporating spatial and channel attention mechanisms, achieving more efficient feature fusion. Meanwhile, Ma et al. [32] improve skip connections through cross-modal and multi-scale fusion using Transformers, enabling a robust representation of surface objects with significant scale variations. However, the existing CNN- and Transformer-based architectures are still deficient in modeling fine-grained local and global contextual information, which are only unfolded in single-modal or simple multimodal scenarios, failing to fully explore the unique value of multimodal data based on the Mamba framework in remote sensing semantic segmentation.

In recent years, with the increasing challenges posed by complex scenes and multi-scale objects in remote sensing images, researchers have begun exploring how large receptive field convolutions can enhance the performance of remote sensing semantic segmentation. To address this issue, various designs for large receptive field convolutional networks have been proposed. Liu et al. [41] adopt a 7 × 7 depthwise convolution, significantly improving performance in general visual tasks. Li et al. [30] proposed a method for selectively expanding the spatial receptive field of large objects. Ding et al. [33] employ 31 × 31 large convolutional kernels, demonstrating the potential of convolutional networks in capturing long-range dependencies. Additionally, Liu et al. [42] extended the kernel size to 51 × 51 through kernel decomposition and sparse grouping, enhancing the receptive field without increasing computational complexity. Meanwhile, Cai et al. [31] leveraged parallel arrangements of various-sized depth-wise convolutional kernels to extract dense texture features across multi-scale receptive fields, enabling convolution operations to capture rich local contextual information under a larger receptive field. These works collectively demonstrate that well-designed large receptive field convolutions can significantly enhance the extraction of local contextual details.

Although large kernel convolutions have demonstrated impressive performance in general vision tasks, their multi-branch parallel structures often introduce substantial computational overhead. In remote sensing imagery, where object boundaries are highly complex and vary significantly in scale, efficiently leveraging large-kernel convolutions to extract fine-grained boundary features, without introducing excessive computational redundancy or noise—remains a significant challenge. To address these issues, we introduce Large Strip Convolutions, which employ elongated, directionally constrained convolutional kernels in the spatial domain. This design allows for more efficient extraction of directional boundary details while mitigating the redundant receptive fields and computational costs associated with global expansion.

State Space Model (SSM) [35,43] has emerged as a practical component for constructing deep networks due to their exceptional performance in analyzing continuous long-sequence data. In recent years, various SSM variants have been introduced to accommodate different application requirements. Structured SSMs [44], which incorporate diagonal structures combined with low-rank methods, have improved modeling efficiency, while the integration of parallel scanning techniques has further enhanced the performance of SSMs in large-scale data processing. Building upon this foundation, the Mamba architecture was developed, refining the linear time-invariant properties of SSMs by incorporating data-dependent parameters, demonstrating superior performance compared to Transformers on large-scale datasets. Inspired by SSM, Liu et al. [45] and Smith et al. [46] successfully introduced Mamba into image classification tasks. Ma et al. [47] were the first to design a visual Mamba-assisted branch based on the VSS module for remote sensing image semantic segmentation. Chen et al. [48] proposed a random smoothing method to explore non-traditional spatial connectivity approaches, while He et al. [49] were the pioneers in applying Mamba to pan-sharpening tasks.

However, the aforementioned methods lack customization when handling multimodal remote sensing data. Clearly, directly applying one-dimensional traversal strategies designed for text sequence data to remote sensing image data inevitably leads to performance degradation, as the relationships between pixels in remote sensing images are vastly different from the relationships between elements in sequence data like text. We have redesigned the Mamba framework, proposing a new scanning method tailored for multimodal remote sensing data, which enhances the efficiency and accuracy of cross-modal information learning, thereby enabling Mamba to capture a broader range of deep global contextual information.

1.2. Contribution

Remote sensing semantic segmentation faces challenges such as insufficient alignment of shallow boundary features and limited modeling of long-range contextual information in deep multimodal fusion, which hinder precise boundary delineation and global semantic consistency. To address these issues, we propose FMLSNet, a boundary-enhanced multilevel multimodal fusion architecture that combines multi-directional large strip convolutions for fine-grained boundary extraction with a multistage Mamba-based fusion framework for deep heterogeneous feature alignment. This hybrid design effectively bridges local detail preservation and global contextual understanding, achieving superior performance on complex multimodal remote sensing scenes. The main contributions of this work are summarized as follows:

(1): We propose a novel FMLSNet model for remote sensing image semantic segmentation. By integrating the LSSD module with the FMB module, our method enhances the recognition of fine-grained objects at multiple scales and improves the capture of deep global contextual features, thereby boosting overall segmentation accuracy. Experimental results on several challenging benchmark datasets demonstrate that our model outperforms state-of-the-art methods.
(2): We propose a multistage Mamba-based multimodal fusion Block (FMB) with a redesigned cross-modal scanning mechanism and disentangled representation learning. This enhances the capture of deep global context, resulting in improved multimodal feature interaction and representation.
(3): We design an LSSD module that employs multiple orientation-specific strip kernels with progressive receptive field growth, enabling a precise extraction of boundary and fine-grained spatial features. This design balances local detail preservation and global structural awareness, which is crucial for complex object boundaries in UHR remote sensing imagery.

The rest of this paper is organized as follows. In Section 2, the proposed model is described in detail. Section 3 presents the experimental results, including comparisons with other methods, demonstrating the effectiveness of our proposed model. Section 4 provides an in-depth discussion of the performance–efficiency trade-off and outlines potential directions for future optimization. Finally, Section 5 concludes the paper with a summary of the contributions.

Figure 2. The overall framework of our proposed FMLSNet.

2. Method

2.1. Overview of FMLSNet

As illustrated in Figure 2, the proposed FMLSNet model is composed of three main components: the Large Strip Spatial Detail (LSSD) module for fine-grained boundary feature extraction based on hybrid large strip convolutions, the Fusion Mamba Block (FMB) for multi-stage multimodal deep feature integration, and a multi-task learning-based cascaded decoder. Specifically, we employ two ResNet [50] branches with pretrained weights to extract features from the visible spectrum image (VSI) and the digital surface model (DSM), respectively, serving as the encoder. In the LSSD module, we apply large strip convolutions in multiple directions to capture edge features along different orientations. These rich directional boundary features are then used to guide the deep learning of shallow features with varied receptive fields from multiple modalities, compensating for the loss of high-level spatial detail and enabling the model to learn more fine-grained multimodal spatial representations. In the FMB module, we fuse deep multimodal remote sensing features through a novel multi-stage Mamba-based fusion strategy to capture more comprehensive global contextual information. In the first stage, we integrate disentangled representation learning into the classical Mamba structure to align the shared feature distributions generated from different modalities more efficiently, enhancing cross-modal information fusion. In the second stage, a novel scanning mechanism is adopted to perform cross-modal scanning, enabling the model to effectively capture modality-specific associations and complementary information, thereby improving fusion precision and semantic consistency. In the third stage, a bidirectional scanning strategy is introduced to achieve a comprehensive integration of deep features, strengthening the model’s representation ability in complex remote sensing scenarios. Finally, the multi-task learning-based cascaded decoder integrates the fine-grained shallow spatial features and the deep global contextual features produced by the preceding modules to generate more accurate segmentation results.

2.2. Large Strip Spatial Detail Module

In remote sensing image semantic segmentation tasks, significant spatial and spectral differences in data from different sources often lead to the degradation of segmentation performance in multimodal fusion networks, which may lead to gap interference problems due to such inter-modal inconsistencies. The works [26,28] have demonstrated that the introduction of boundary information plays a critical role in the calibration process during feature fusion, effectively mitigating the negative effects caused by modal differences and improving the boundary clarity and accuracy of the segmentation results. To address this, we propose an innovative hybrid large strip method. This method employs specially designed large strip convolutions to precisely capture boundary information in different directions, significantly alleviating spatial separation issues. Meanwhile, by multiple parallel large strip convolutions with varying sizes, the module enhances the ability to capture fine details under diverse receptive fields, achieving pixel-level precision in critical regions. Moreover, the extracted multidirectional boundary features further guide the deep learning of multimodal shallow features under different receptive fields, which makes up for the lack of detailed information of high-level features, thus capturing and fusing multimodal shallow feature information more comprehensively.

We denote

X \in R^{H \times W \times 3}

and

Y \in R^{H \times W \times 1}

to represent the VSI and corresponding DSM data, where H and W denote the height and width of the input modalities. Specifically, each CNN encoder consists of four ResBlocks, which are responsible for extracting shallow detail features. The generated feature map has a downsampling size of

(H / 2^{i - 1}) \times (W / 2^{i - 1}) \times C_{i}

, where

i

represents the index of the ResBlock layer. The shallow features extracted by each layer are processed by the LSSD module. Before the fused features are input into the next VSI encoder branch, the features from the auxiliary modality are integrated as input for the primary modality (i.e., VSI).

As shown in Figure 3, the LSSD module consists of two modality branches. Unlike standard convolution operations, we apply a channel-wise splitting operation to the input feature map. Specifically, the input feature map is evenly divided into four parts along the channel dimension. Asymmetric padding is then used to construct horizontal and vertical convolution kernels for different spatial regions of the image. Each part is subjected to a large strip convolution in a specific direction. Since both branches follow the same structure, we take the VSI modality at layer iii as an example for the mathematical formulation, as follows:

{\hat{X}}_{1} = D W C o n v_{1 \times k^{(i)}} (X_{i}^{p a d (1,0, 0, k^{(i)})})

(1)

{\hat{X}}_{2} = D W C o n v_{k^{(i)} \times 1} (X_{i}^{p a d (0, k^{(i)}, 0,1)})

(2)

{\hat{X}}_{3} = D W C o n v_{k^{(i)} \times 1} (X_{i}^{p a d (0,1, k^{(i)}, 0)})

(3)

{\hat{X}}_{4} = D W C o n v_{1 \times k^{(i)}} (X_{i}^{p a d (k^{(i)}, 0,1, 0)})

(4)

where

{\hat{X}}_{1}, {\hat{X}}_{2}, {\hat{X}}_{3}, {\hat{X}}_{4} \in R^{(H / 2^{i - 1}) \times (W / 2^{i - 1}) \times C_{i} / 4}

denote the features extracted from four different directions in the VSI modality branch using depthwise separable convolutions(DWConv). The padding parameters

p a d (1,0, 0, k^{(i)})

represent the number of pixels padded in the left, right, top, and bottom directions, respectively. We define a linear growth rate for

k^{(i)} = 11 + i \times 2

(where i denotes the layer index of the ResBlock), dynamically adjusting the convolutional kernel size according to the depth of the current stage iii within the LSSD module. This mechanism progressively enlarges the receptive field of deeper network layers, achieving a balanced trade-off between feature richness and computational cost. The concatenated and interleaved convolution results are then computed as follows:

X_{i}^{'} = C o n c a t ({\hat{X}}_{1}, {\hat{X}}_{2}, {\hat{X}}_{3}, {\hat{X}}_{4})

(5)

Y_{i}^{'} = C o n c a t ({\hat{Y}}_{1}, {\hat{Y}}_{2}, {\hat{Y}}_{3}, {\hat{Y}}_{4})

(6)

Here,

X_{i}^{'} {, Y}_{i}^{'} \in R^{(H / 2^{i - 1}) \times (W / 2^{i - 1}) \times C_{i}}

represent the feature maps obtained by concatenating two modalities, respectively, are represented, where

C o n c a t

denotes the channel-wise concatenation operation. Our LSSD module is capable of capturing precise detail information and enhancing the representational capacity of features without compromising the integrity of local spatial structures. Subsequently, a 2 × 2 convolution without padding is first applied to normalize the concatenated tensor. This is followed by a global average pooling (AvgPool) operation to aggregate global information, compress the channel dimension, and extract localized spatial features.

F_{x}^{p o o l} = G A P (C o n v_{2 \times 2} (X_{i}^{'}))

(7)

F_{y}^{p o o l} = G A P (C o n v_{2 \times 2} (Y_{i}^{'}))

(8)

where GAP denotes global average pooling. Subsequently, we introduce multiple convolutional modules inspired by channel-aware mechanisms to perform fine-grained learning on

F_{x}^{p o o l}

and

F_{y}^{p o o l}

.

F_{x} = C o n v_{1} (Re L U (C o n v_{2} (F_{x}^{p o o l})))

(9)

F_{y} = C o n v_{1} (Re L U (C o n v_{2} (F_{y}^{p o o l})))

(10)

where

C o n v_{1}

denotes a convolutional layer with an output dimension of

\max (32, C_{i} / 16)

, and

C o n v_{2}

denotes a convolutional layer with an output dimension of

C_{i}

. ReLU represents the ReLU activation function. Finally, for the calibration of the parallel dual branches, the correction formula for the original feature map is given as follows:

F_{f u s e} = (X_{i} \otimes S (F_{x})) + (Y_{i} \otimes S (F_{y}))

(11)

Here, S denotes the Sigmoid activation function, and the

\otimes

represents pixel-wise multiplication. Through the carefully designed multi-directional large strip convolutions, the model captures both rich local spatial details and global boundary features. This enables better encoding of boundary constraints and local spatial contextual information, resulting in the generation of robust and precise shallow spatial detail features

F_{f u s e}

.

2.3. Mamba-Based Deep Feature Fusion

VSI and DSM play critical roles in UHR semantic segmentation. While VSI features provide rich semantic cues, DSM features offer more prominent object layout information. However, Transformer-based methods suffer from quadratic complexity and often require dimensionality reduction for global attention, which can lead to the loss of modality and context information. Unlike prior Mamba-based approaches [37,47] that operate on 1D sequences and ignore spatial dependencies, we introduce a cross-modal 2D scanning mechanism combined with disentangled representation learning to enhance spatial alignment and reduce modality interference, making it well-suited for high-resolution multimodal data. Based on this, we design FMB with three progressive stages. In the first stage, SDMamba, disentangled representation learning is integrated into the Mamba [51] structure to effectively align shared feature distributions across modalities, thereby improving cross-modal information interaction. In the second stage, MBMamba leverages a cross-modal scanning mechanism to deeply model inter-modal correlations and reveal latent complementarities between multimodal features, achieving finer semantic integration. In the final stage, BFMamba employs a bidirectional scanning strategy within the Mamba framework to fully fuse deep features and capture global contextual information, ultimately providing more robust feature representations for complex scene modeling.

Selective Disentangled Mamba (SDMamba): The final output features of the last ResBlock in the dual-branch ResNet encoder are denoted as

F_{0}^{x}, F_{0}^{y} \in R^{(H / 2^{I - 1}) \times (W / 2^{I - 1}) \times C_{I}}

, where

I

and

C_{I}

represent the layer index of the final ResBlock and the output channel size, respectively.

F_{n}^{x}

and

F_{n}^{y}

correspond to the features of the VSI and DSM branches, respectively, at the nth layer, where

n \in 1,2, \dots, N_{1} + N_{2} + N_{3}

. The entire fusion process of the SDMamba block can be represented as follows.

Specifically, the SDMamba block consists of two identical SDMamba branches, as shown in the upper part of Figure 4. To achieve the disentanglement of different modalities, the input features are first processed through a series of linear projections (Linear) and depth-wise separable convolutions (DWConv) to generate cross-constructed images containing modality-specific attribute information. Subsequently, the Selective Scan 2D (SS2D) part integrates these cross-constructed images with attribute-related features, producing cross-reconstructed images. Moreover, residual connections are employed to preserve feature consistency and integrity. In the SS2D part, as depicted in the lower part of Figure 4, the input features

F_{n - 1}^{x}

are reshaped: top-left to bottom-right, bottom-right to top-left, top-right to bottom-left, and bottom-left to top-right, forming a total of four sequences. Next, four different selective scanning modules [44] are used to extract multi-directional information. Finally, the four sequences are reversed in the same direction and summed.

The disentangled learning in SDMamba helps separate modality-specific and shared features by projecting cross-reconstructed representations. This separation enhances alignment in common semantic spaces and reduces modality interference during fusion.

Mutually Boosted Mamba (MBMamba): After performing deep feature enhancement in the SDMamba, the FMB further utilizes

N_{2}

MBMamba blocks to extract and integrate multimodal features and contextual information from the semantic space, as shown in Figure 4. The overall fusion process of the MBMamba block can be mathematically represented as follows:

The MBMamba, as illustrated in the upper part of Figure 5, where the system matrices B, C, and Δ are generated from the inputs to realize the context-aware capability of the model. Here, we utilize a linear projection layer to generate the matrix. Inspired by the cross-attention mechanism [37], widely used in multimodal tasks, we propose the use of the C matrix generated from complementary modalities in selective scanning operations, which enables SSM to reconstruct outputs from hidden states guided by another modality and facilitates the exchange of information between multiple selective scanning mechanisms. Specifically, the interaction linkage mechanism can be formulated as follows:

{\bar{A}}_{d s m}, {\bar{B}}_{d s m} = \exp (Δ_{d s m} A_{d s m}), Δ_{d s m} B_{d s m}

(12)

{\bar{A}}_{v i s}, {\bar{B}}_{v i s} = \exp (Δ_{v i s} A_{v i s}), Δ_{v i s} B_{v i s}

(13)

h_{d s m}^{t} = {\bar{A}}_{d s m} h_{d s m}^{t - 1} + {\bar{B}}_{d s m} x_{d s m}^{t}, y_{d s m}^{t} = C_{v i s} h_{d s m}^{t} + D_{d s m} x_{d s m}^{t}

(14)

h_{v i s}^{t} = {\bar{A}}_{v i s} h_{v i s}^{t - 1} + {\bar{B}}_{v i s} x_{v i s}^{t}, y_{v i s}^{t} = C_{d s m} h_{v i s}^{t} + D_{v i s} x_{v i s}^{t}

(15)

where

x_{r g b}^{t}

,

x_{d s m}^{t}

represent the time-step t inputs of the VSI and DSM modalities, respectively, while

y_{v i s}^{t}

and

y_{d s m}^{t}

denote the outputs of the selective scanning operation.

C_{v i s}

and

C_{d s m}

represent the cross-modal matrices, which are used to recover the outputs at each time-step from the hidden states.

The fused feature maps are enhanced through the

N_{3}

SDMamba Blocks, and the final outputs are denoted as

F_{n}^{x}

and

F_{n}^{y}

Finally, the further enhanced feature maps are processed through the BFMamba’s bi-directional scanning mechanism and feature fusion strategy, enabling seamless integration of the outputs from the SDMamba and MBMamba stages within the BFMamba stage. This process facilitates efficient global contextual information modeling. The entire fusion process of the BFMamba block can be expressed as

F_{f u s e} = B F M a m b a (F_{n}^{x}, F_{n}^{y})

(16)

Specifically, as shown in the bottom part of Figure 5, the final outputs

F_{n}^{x}

and

F_{n}^{y}

from the SDMamba are first processed through a linear and depth-wise convolutional layer, yielding two features, which are then unfolded and concatenated along the sequence dimension to form the forward sequence

S_{F o r w a r d} \in R^{2 \times (H / 2^{I - 1}) \times (W / 2^{I - 1}) \times C_{I}}

. Additionally, to comprehensively capture multimodal information, we perform reverse scanning to construct the sequence

S_{I n v e r s e} \in R^{2 \times (H / 2^{I - 1}) \times (W / 2^{I - 1}) \times C_{I}}

, obtaining an augmented sequence. Each sequence is processed individually to capture long-term dependencies, followed by flipping the processed sequences and adding them back to the original sequences. The corresponding process can be expressed as follows:

{\overset{̑}{F}}_{n}^{x} = D W C o n v (L i n e a r (F_{n}^{x}))

(17)

{\overset{̑}{F}}_{n}^{y} = D W C o n v (L i n e a r (F_{n}^{y}))

(18)

S_{F o r w a r d} = C o n c a t ({\overset{̑}{F}}_{n}^{x}, {\overset{̑}{F}}_{n}^{y})

(19)

S_{I n v e r s e} = I n v e r s e (S_{F o r w a r d})

(20)

F_{f u s e} = C o n c a t (S S M (S_{F o r w a r d}), S S M (S_{I n v e r s e}))

(21)

Here,

L i n e a r

denotes the linear projection layer, while

S_{F o r w a r d}

and

S_{I n v e r s e}

represent the forward sequence and reverse sequence, respectively.

S S M

stands for Spatial-Sequence Modeling.

Finally, the features are reduced in shape to

R^{(H / 2^{I - 1}) \times (W / 2^{I - 1}) \times C_{I}}

using a linear projection layer. Within the proposed FMB framework, the rich contextual information across modalities is deeply fused before being input into the semantic-level joint decoder, achieving more refined feature integration and higher decoding efficiency.

2.4. Multitask Learning-Based Cascaded Decoder

Shallow spatial detail features and global contextual information deep features are complementary to each other. While shallow spatial detail features lack rich semantic information, they have finer details, clear boundaries, and less distortion. On the contrary, global context information deep features contain much semantic information. Obviously, simply fusing the two features directly may introduce redundant information or lead to inconsistency, affecting the final segmentation results. To address this issue, we adopt a multitask learning strategy during training that jointly addresses semantic segmentation and boundary detection. Accordingly, separate supervised loss functions are designed for each task to guide the learning process.

As shown in Figure 1, multiple cascaded decoder networks connect to the corresponding layers of the ResNet backbone by skipping connections, progressively restoring the spatial resolution to H × W. Each decoder network consists of an upsampling layer, a convolutional layer, and a ReLU activation layer. The loss function for the semantic segmentation task is as follows:

L_{s e g} = L o s s_{C E} (p, q)

(22)

L o s s_{C E} (p, q) = - \frac{1}{H W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} \sum_{n = 1}^{N} (p_{i j}, n, \log q_{i j}, n)

(23)

where

q

represents the predicted segmentation result,

p

denotes the ground truth,

N

indicates the number of classes, and subscripts i, j correspond to the pixel located at the i-th row and j-th column in the nth segmentation result image.

In the boundary detection task, we utilize fine-grained shallow features to generate boundary feature maps, thereby precisely capturing boundary regions in the image. This strategy allows the network to effectively preserve local details when handling complex object boundaries, resulting in higher-quality boundary information and better segmentation performance in the final results. The loss function for boundary detection is as follows:

L_{b d} = L o s s_{B C E} (p, q)

(24)

L o s s_{B C E} (p, q) = - \frac{1}{M \times N} \sum_{i j} (q_{i j} \log p_{i j} + (1 - q_{i j}) \log (1 - p_{i j}))

(25)

where p represents the predicted boundary detection map, and q denotes the ground truth for boundary detection. i, j indicate the pixel at the ith row and jth column in the boundary detection results, while M × N represents the size of the input image.

The hyperparameter

λ

is used to balance the effects of the two loss functions. Through ablation experiments, both

λ_{1}

and

λ_{2}

are ultimately set to 0.5.

The multitask learning approach tightly integrates the semantic segmentation task and boundary detection task, effectively enhancing the network’s ability to perceive object shapes, contours, and complex boundary information. This significantly improves the accuracy and robustness of semantic segmentation.

3. Results

3.1. Experimental Settings

All experiments in this study were conducted using the PyTorch 2.1.0framework on a single NVIDIA GTX 4070 Ti Super GPU with 16GB of RAM for both training and testing. All models were trained using the Stochastic Gradient Descent (SGD) algorithm, with images processed into 256 × 256 patches using a sliding window approach. After collecting samples with the sliding window, simple data augmentation is applied to the input data, such as random rotation and flipping. The learning rate is set to 0.01, the momentum to 0.9, the decay factor to 0.0005, and a batch size of 10.

To evaluate the semantic segmentation performance of multimodal remote sensing data, we adopt Overall Accuracy (OA), mean F1-score (mF1), and mean Intersection over Union (mIoU) as evaluation metrics. These metrics provide an effective and fair comparison between the proposed FMLSNet and other state-of-the-art methods. Specifically, OA measures the overall accuracy of both foreground and background classes, while mF1 and mIoU focus on the five foreground classes, further assessing the model’s segmentation precision for target objects. The mathematical equations of these three evaluation metrics are presented as follows:

{precision}_{c} = \frac{T P_{c}}{F P_{c} + T P_{c}}

(26)

recall = \frac{T P_{c}}{T P_{c} + F N_{c}}

(27)

F 1 Score = 2 \times \frac{{precision}_{c} \times {recall}_{c}}{{precision}_{c} + {recall}_{c}}

(28)

mIoU = \frac{1}{C} \sum_{c = 1}^{C} \frac{T P_{c}}{F P_{c} + T P_{c} + F N_{c}}

(29)

OA = \frac{{\sum_{c = 1}^{C} TP}_{c} + T N_{c}}{\sum_{c = 1}^{C} T P_{c} + F P_{c} + T N_{c} + F N_{c}}

(30)

where

T P_{c}

represents the number of pixels correctly classified as a specific land cover category,

F P_{c}

denotes the number of pixels incorrectly classified as belonging to the specific land cover category,

T N_{c}

refers to the number of pixels correctly classified as not belonging to the specific land cover category, and

F N_{c}

indicates the number of pixels incorrectly classified as not belonging to the specific land cover category. All of these metrics are associated with each land cover class indexed by c.

We benchmarked FMLSNet against several typical and state-of-the-art semantic segmentation methods, including ABCNet [52], ESANet [8], MAResUNet [53], PSPNet [12], SA-GATE [40], CMGFNet [39], EIGNet [54], TransUNet [16], CMFNet [32], UNetformer [14], MFTransNet [22], CMTFNet [55], MCSNet [15], and RS3Mamba [47]. In our experiments, single-modality methods were evaluated using only the primary modality, namely VSI. These advanced single-modality methods highlight the contribution of DSM data. In other words, the advantages of multimodality over unimodality are obvious. The experimental results are presented in Table 1 and Table 2.

3.2. Datasets

As shown in Figure 6a,b, the Vaihingen dataset consists of 16 high-resolution aerial images with an average size of 2500 × 2000 pixels, captured from the Vaihingen region in Germany. The Ground Sampling Distance (GSD) is 9 cm, and the dataset includes near-infrared (NIR), red (R), and green (G) channels, along with DSM data. The dataset contains six classes: impervious surfaces, buildings, low vegetation, trees, cars, and cluttered backgrounds. To improve storage and reading efficiency, sliding windows of size 256 × 256 are used during both the training and testing phases, instead of cropping the blocks into smaller images. This results in approximately 960 training images and 320 testing images.

As shown in Figure 6c,d, the Potsdam dataset is significantly larger than the Vaihingen dataset, consisting of 24 high-resolution orthophotos with a resolution of 6000 × 6000 pixels. It includes four multispectral bands: infrared, red, green, and blue (IRRGB), as well as a normalized DSM with a resolution of 5 cm. For training and testing, we only used the red, green, and blue (RGB) image data. These 24 images are divided into 18 samples for training and 6 samples for testing. After processing with the same sliding window method, the dataset contains 10,368 training samples and 3456 testing samples.

Additionally, the background class labeled as Clutter contains indistinguishable debris and water surfaces. Figure 3 presents visual examples from both datasets. It is important to note that during the training phase, the stride is set to 256, while during the testing phase, the stride is reduced to 32. The smaller stride during testing helps reduce boundary effects by averaging the prediction results from the overlapping regions.

3.3. Experimental Results

Performance Comparison on the Vaihingen Dataset: As shown in Table 1, the proposed FMLSNet achieved the best results in terms of mIoU, mean F1 (mF1), and overall accuracy (OA). Compared to the second-best method, RS3Mamba [46], FMLSNet improved OA by 0.58%, mIoU by 1.39%, and mF1 by 0.84%.

To visually compare the segmentation performance of different algorithms, we present the segmentation results on the Vaihingen dataset in Figure 7. FMLSNet demonstrates superior performance in capturing texture details and handling boundary regions, particularly in areas with complex boundaries such as trees and low vegetation, achieving clearer and more precise segmentation results, highlighting its stronger capability in modeling long-range dependencies. Moreover, in complex scenes with densely packed buildings and interwoven trees, the proposed model accurately detects boundary details, effectively reducing boundary blurring and object omission issues. These results underscore the robustness and adaptability of FMLSNet.

Performance Comparison on the Potsdam Dataset: The experimental results on the Potsdam dataset also confirm the excellent performance similar to that on the Vaihingen dataset. As shown in Table 2, FMLSNet achieves classification accuracies of 97.88%, 88.16%, 96.51%, and 92.96% for buildings, trees, cars, and impervious surfaces, respectively. Particularly, the segmentation of buildings, trees, and impervious surfaces stands out.

Figure 5 presents a visual comparison of different methods on the Potsdam dataset. By observing Figure 8, it is evident that FMLSNet shows advantages in recognizing objects at different scales, such as buildings, trees, and cars. Specifically, in the two red boxes, The performance of FMLSNet is particularly impressive in complex scenes. In the annotated area, where buildings are obscured by trees, FMLSNet accurately segments the trees and successfully identifies the building contours beneath them, closely resembling real-world ground conditions. Overall, FMLSNet demonstrates superior performance in handling complex objects, significantly improving segmentation accuracy in complex scenes.

Table 1 and Table 2 demonstrate that ABCNet [52], MAResUNet [53], and ESANet [8] utilize local attention mechanisms (LAM) to effectively reduce computational costs while enhancing the extraction of global contextual information. However, these approaches still exhibit limitations in the precision of global information aggregation, particularly in scenarios that require capturing long-range dependencies within complex scenes. PSPNet [12] introduces a pyramid pooling module that integrates global contextual information, yielding high-quality results for scene semantic analysis. Nonetheless, the pyramid pooling strategy may incur additional computational overhead, increasing model complexity. CMGFNet [39], SA-GATE [40], and CMFNet [32] are representative multimodal segmentation methods. CMGFNet employs a cascaded multimodal fusion module to progressively integrate data from different modalities, enhancing segmentation accuracy for details and boundaries, but its step-by-step processing framework may lead to computational delays in complex scenarios. SA-GATE uses adaptive attention mechanisms to guide modality fusion and reduce inter-modal conflicts, yet its attention mechanism could require further optimization in cases of significant modality discrepancies. CMFNet applies cross-attention mechanisms to fuse features across multiple scales, improving performance in complex scenes, but its intricate model architecture may pose efficiency challenges when handling UHR remote sensing imagery. Transformer-based segmentation methods, such as TransUNet [16] and UNetFormer [14], leverage efficient Transformer mechanisms to extract global contextual information, significantly enhancing their capacity to understand complex scenes. Despite their effectiveness in large-scale segmentation tasks, these methods may introduce pixel-level semantic overlaps between objects due to the implicit nature of global contextual information acquisition, potentially compromising boundary localization accuracy. CMTFNet [55] utilizes the multiscale Transformer module to extract multimodal global context information at different scales, which reduces model complexity while retaining some of the global features; however, while reducing model complexity may sacrifice a certain degree of detailed information, making it underperform in fine-grained segmentation tasks. MCSNet [15] and EIGNet [54] adopt specialized designs to improve object boundary smoothness, but they remain susceptible to false boundaries in complex scenarios, which can result in incomplete segmentation. RS3Mamba [47] effectively exploits the advantages of the Mamba structure, capturing global features efficiently while reducing the computational complexity of attention mechanisms from quadratic to linear. Nevertheless, its performance may benefit from further optimization when addressing highly complex or non-linear modality distributions, particularly in terms of multimodal adaptability, which still warrants deeper investigation.

As illustrated in Figure 9, our method produces edge predictions on the Vaihingen and Potsdam datasets that closely align with the ground-truth boundaries, accurately delineating complex object contours. Our predicted edges exhibit greater continuity and smoothness while significantly reducing intraclass noise. This improvement stems from the boundary-aware modeling in the LSSD module at shallow stages and the joint supervision of semantic and boundary features via our multitask learning strategy, enabling robust boundary segmentation even in complex multimodal scenarios.

Comparing the above baselines, FMLSNet demonstrates best performance, which can be attributed to the following two innovative designs: (1) By extracting rich multi-directional boundary features, the multimodal shallow features are guided to deep learning under different sensory fields, thus capturing more shallow spatial detail feature information in a more comprehensive way. (2) A novel fusion Mamba module is designed, which incorporates an innovative scanning mechanism combined with disentanglement learning to fully explore and learn more representative cross-modal feature representations. This design further enhances the expression of global semantic information, enabling more accurate feature modeling and segmentation performance in complex remote sensing scenarios.

3.4. Ablation Study

To validate the effectiveness of each module in the proposed FMLSNet, we conducted ablation experiments by sequentially removing core components to analyze their impact on overall performance. The experimental results are shown in Table 3. We used a dual-branch framework as the baseline and systematically added the LSSD, FMB, and BAD modules to investigate their contributions to segmentation performance.

In the first experiment, we replaced the proposed FMB module with two independent single-modality Mamba branches, keeping only the LSSD module in the ResBlock. The results showed that although the overall accuracy (OA) reached 91.93%, it decreased compared to the complete model. This indicates that the FMB module integrates cross-modal enhancement with disentanglement learning, improving the model’s ability to capture global context information and inter-modal interactions during deep feature fusion. In the second experiment, we removed the LSSD module from both branches to evaluate its impact on shallow feature learning. The experimental results showed a further decrease in OA to 91.62%, validating the ability of the LSSD module to capture multidirectional boundary features through large strip convolutions, compensating for the lack of high-level detail information. Additionally, it effectively extracted rich shallow detail features during shallow feature fusion, laying the foundation for overall performance improvement. In the third experiment, we added multitask learning on top of the first experiment by introducing a boundary-aware decoder(BAD) for the boundary detection task. The results showed that OA increased to 92.07%, indicating that the boundary detection task effectively optimized the model’s perception of boundary regions, significantly enhancing the completeness and clarity of the segmentation results, thereby providing stronger support for detailed expression in complex scenes.

To more clearly demonstrate the effectiveness of the core method, we present heatmaps generated by FMLSNet in Figure 10b–d. Each subfigure in Figure 10 is labeled to clearly indicate the different contents. Specifically, the three heatmaps of FMLSNet are extracted after the LSSD module, the FMB module, and just before the semantic segmentation head, respectively. These heatmaps correspond to feature maps generated at different stages, showcasing the model’s ability to distinguish between category pixels such as buildings, roads, trees, and cars. Figure 10 shows that FMLSNet exhibits more high-activation regions across all samples, indicating that FMLSNet can extract richer semantic information at the global scale through a multilevel multimodal fusion strategy, providing strong support for the integration of multimodal features. Furthermore, the boundaries of the highly activated regions are more closely aligned with the actual contours of the objects, thanks to our proposed hybrid large strip method, which refines multi-directional boundary features and further guides the learning of shallow spatial features under multiscale receptive fields, compensating for the loss of high-level detail information and thereby more accurately capturing and integrating multimodal shallow features. The carefully designed Mamba cross-modal scanning mechanism not only enhances the ability to capture deep features but also strengthens the expression and interaction of cross-modal information by effectively modeling the long-range dependencies between modalities, providing more comprehensive and precise support for feature integration in complex scenarios. The results in Figure 10 further validate the advantages of our proposed multilevel fusion approach in multimodal information extraction and fusion, thus improving the overall performance of semantic segmentation.

DSM analysis of experimental comparisons: As shown in Table 4, FMLSNet significantly improves the classification accuracy of most categories across both datasets by effectively integrating additional DSM data. In particular, the performance improvement is especially significant on two classes of objects, buildings and impervious surfaces, which are usually characterized by stable surface heights, providing the model with a clear basis for discrimination and improving the classification accuracy. Furthermore, since cars are usually located on roads, the relatively consistent height of road features also aids in improving the boundary identification accuracy of cars. FMLSNet demonstrates significant performance gains on both datasets, driving the overall optimization of the semantic segmentation task.

Next, we validate the effectiveness of the LSSD module through an ablation study on different large strip kernel size configurations in Table 5. The first column lists five groups of strip kernel sizes used in the experiments. Results indicate that smaller kernels struggle to capture long-range dependencies, leading to insufficient global context modeling and degraded performance. In contrast, larger kernels encompass broader contextual information and enhance boundary and semantic feature representation. Notably, our proposed progressive expansion strategy, which gradually increases the strip kernel size as the network depth grows, achieves the best trade-off between preserving fine-grained details and modeling global context, thereby delivering superior segmentation performance.

3.5. Hyperparameter Selection

To evaluate the overall impact of the two tasks on semantic segmentation results, we conducted comprehensive experiments by adjusting the weighting coefficients

λ_{1}

and

λ_{2}

in Equation (17). Experiments are performed on the Vaihingen dataset, as it exhibits trends similar to those observed in the Potsdam dataset. As shown in Figure 11, we systematically adjusted the ratio between

λ_{1}

and

λ_{2}

. Notably, when both coefficients

λ_{1}

and

λ_{2}

are set to 0.5, the best results are achieved, indicating that spatial boundary information plays a critical role in the semantic segmentation task. However, if either task is assigned an excessively high weight, the dominant task during training can lead to information bias toward that task, potentially causing the network to overlook the contributions of the other task. Therefore, we select 0.5 as the optimal value for

λ_{1}

and

λ_{2}

in all experiments.

3.6. Model Parameters and Computation Complexity Analysis

Based on the analysis in Table 6, we evaluated the computational complexity of FMLSNet, focusing primarily on the floating-point operation count (FLOPs), the number of model parameters, and the segmentation accuracy (mIoU). Ideally, an efficient model should minimize FLOPs and parameter count while achieving excellent segmentation accuracy.

Table 5 demonstrates that FMLSNet achieves a segmentation accuracy of 84.29% in terms of mIoU, ranking the best among all methods and significantly outperforming traditional single-modal and multimodal approaches such as CMTFNet, CMFNet, and MAResUNet, which is due to the fact that FMLSNet employs a large strip convolution technique for shallow feature extraction and uses Mamba for deep global receptive field modeling, resulting in an efficient fusion of multimodal data and accurate capture of global contextual information. These designs empower FMLSNet with robust feature representation capabilities and computational efficiency when handling complex multimodal data. Furthermore, compared to lightweight models such as UNetFormer, MFTransUNet, and CMGFNet, FMLSNet achieves significant improvements in segmentation performance while substantially reducing model complexity compared to more complex Transformer-based architectures, demonstrating a balanced trade-off between computational efficiency and accuracy. Finally, FMLSNet showcases its superiority in achieving high-precision remote sensing semantic segmentation while maintaining low computational complexity. While FMLSNet introduces moderate computational overhead, the performance gain justifies its deployment in tasks where segmentation precision is critical.

4. Discussion

This study presents FMLSNet, A Multilevel Multimodal Hybrid Mamba-Large Strip Convolution Network for remote sensing semantic segmentation that integrates LSSD with FMB to enhance shallow boundary feature extraction and deep multimodal semantic interaction. Extensive ablation and comparative experiments on the Vaihingen and Potsdam datasets demonstrate that FMLSNet achieves superior boundary delineation and overall segmentation accuracy, while maintaining strong generalization capability.

In terms of the performance–efficiency trade-off, FMLSNet introduces a moderately higher computational cost than CNN-based approaches but remains substantially more efficient than Transformer-based methods. Its progressive strip kernel design and linear Mamba scanning strategy preserve boundary precision while avoiding the quadratic complexity of attention mechanisms. Compared with existing Mamba-based scanning mechanisms, FMLSNet employs a cross-modal 2D scanning strategy that simultaneously models horizontal and vertical spatial dependencies and leverages disentangled representation learning to reduce modality interference, resulting in more effective multimodal feature alignment.

Nevertheless, some limitations remain: the use of large kernels and multistage fusion increases parameter count and memory footprint, and performance may fluctuate when applied to tasks with heterogeneous annotation standards or unbalanced modalities. Future work will explore expanding the framework to additional modalities (e.g., SAR, LiDAR, hyperspectral) and related dense prediction tasks, while incorporating dynamic large strip kernel adaptation based on scene complexity to achieve a better balance between efficiency and accuracy.

5. Conclusions

In this study, we propose a novel boundary-enhanced multi-level multimodal fusion Mamba–Large Strip Convolution Network (FMLSNet), specifically designed for semantic segmentation of UHR remote sensing images. The model follows an encoder–decoder architecture and incorporates two key components: the Mamba-based multimodal fusion framework (FMB) and the Large Strip Spatial Detail extraction module (LSSD). The FMB module employs an innovative multimodal scanning mechanism combined with disentangled representation learning to extract and fuse deep features and cross-modal contextual information from multimodal remote sensing data, effectively enhancing the modeling of global semantic information. The LSSD module adaptively integrates multi-directional large strip convolutions to precisely extract fine-grained boundary features, strengthening shallow spatial detail learning and further improving segmentation accuracy and robustness.

To further enhance segmentation precision, we design a multitask learning strategy that simultaneously addresses semantic segmentation and boundary detail detection, optimizing boundary localization accuracy. Extensive experiments on two challenging remote sensing datasets, Potsdam and Vaihingen, demonstrate that FMLSNet significantly outperforms state-of-the-art methods on both datasets. Ablation studies validate the critical contributions of the FMB and LSSD modules, further proving the effectiveness of our approach. Comprehensive evaluation results show that FMLSNet not only surpasses existing techniques in segmentation accuracy but also exhibits good robustness when handling complex remote sensing image scenarios.

In summary, the FMLSNet model serves as an effective tool for semantic segmentation of multimodal remote sensing images by incorporating multistage fusion and boundary detail enhancement mechanisms, significantly improving segmentation precision and deep feature capture capabilities. In future work, we plan to further optimize the model’s computational complexity and enhance boundary detail extraction to better meet the demands of remote sensing image segmentation.

Author Contributions

Conceptualization, L.Y., Q.F., and X.T.; methodology, L.Y.; validation, Q.F., J.C., and X.F.; writing—original draft preparation, X.T., J.W., and Q.F.; writing—review and editing, J.C.; visualization, X.F.; supervision, X.T.; project administration, L.Y.; funding acquisition, L.Y. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under grants 62472149, 62302155 and 62272356.

Data Availability Statement

The original contributions presented in this study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

The authors are thankful to the providers for all the datasets used in this study. We are also thankful to the anonymous reviewers and editors for their comments to improve this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, X.; Chen, Y.; Huang, L.; Hong, D.; Du, Q. Foundation model-based multimodal remote sensing data classification. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5502117. [Google Scholar] [CrossRef]
Li, Y.; Zhou, Y.; Zhong, L.; Wang, J.; Chen, J. DKDFN: Domain Knowledge-Guided deep collaborative fusion network for multimodal unitemporal remote sensing land cover classification. ISPRS J. Photogramm. Remote Sens. 2022, 186, 170–189. [Google Scholar] [CrossRef]
Zhao, Y.; Ban, Y.; Sullivan, J. Tokenized Time-Series in Satellite Image Segmentation With Transformer Network for Active Fire Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4405513. [Google Scholar] [CrossRef]
Xie, Y.; Yuan, X.; Zhu, X.; Tian, J. Multimodal Co-Learning for Building Change Detection: A Domain Adaptation Framework Using VHR Images and Digital Surface Models. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5402520. [Google Scholar] [CrossRef]
Sun, Y.; Lei, L.; Liu, L.; Kuang, G. Structural regression fusion for unsupervised multimodal change detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4504018. [Google Scholar] [CrossRef]
Shi, X.; Gao, J.; Yuan, Y. Enhancing Uni-Modal Features Matters: A Multi-Modal Framework for Building Extraction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622013. [Google Scholar] [CrossRef]
Meng, Y.; Chen, S.; Liu, Y.; Li, L.; Zhang, Z.; Ke, T.; Hu, X.J.R.S. Unsupervised building extraction from multimodal aerial data based on accurate vegetation removal and image feature consistency constraint. Remote Sens. 2022, 14, 1912. [Google Scholar] [CrossRef]
Algiriyage, N.; Prasanna, R.; Stock, K.; Doyle, E.E.; Johnston, D.J.S.C.S. Multi-source multimodal data and deep learning for disaster response: A systematic review. SN Comput. Sci. 2022, 3, 92. [Google Scholar] [CrossRef]
Zhang, X.; Yu, W.; Pun, M.-O.; Shi, W. Cross-domain landslide mapping from large-scale remote sensing images using prototype-guided domain-aware progressive representation learning. ISPRS J. Photogramm. Remote Sens. 2023, 197, 1–17. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 640–651. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; proceedings, part III 18. pp. 234–241. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, arXiv:1706.03762. [Google Scholar]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Y.; Xiong, S.; Lu, X.; Zhu, X.X.; Mou, L. Integrating detailed features and global contexts for semantic segmentation in ultra-high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4703914. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Yu, H.; Wang, F.; Hou, Y.; Wang, J.; Zhu, J.; Cui, Z.J.R.S. CMFPNet: A Cross-Modal Multidimensional Frequency Perception Network for Extracting Offshore Aquaculture Areas from MSI and SAR Images. Remote Sens. 2024, 16, 2825. [Google Scholar] [CrossRef]
Ghamisi, P.; Rasti, B.; Yokoya, N.; Wang, Q.; Hofle, B.; Bruzzone, L.; Bovolo, F.; Chi, M.; Anders, K.; Gloaguen, R.J.I.G.; et al. Multisource and multitemporal data fusion in remote sensing: A comprehensive review of the state of the art. IEEE Geosci. Remote Sens. Mag. 2019, 7, 6–39. [Google Scholar] [CrossRef]
Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; pp. 213–228. [Google Scholar]
Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.-O.; Liu, M. A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403215. [Google Scholar] [CrossRef]
He, S.; Yang, H.; Zhang, X.; Li, X.J.M. MFTransNet: A multi-modal fusion with CNN-transformer network for semantic segmentation of HSR remote sensing images. Mathematics 2023, 11, 722. [Google Scholar] [CrossRef]
Xu, D.; Wang, Y.; Zhang, X.; Zhang, N.; Yu, S.J.I.A. Infrared and visible image fusion using a deep unsupervised framework with perceptual loss. IEEE Access 2020, 8, 206445–206458. [Google Scholar] [CrossRef]
Xu, S.; Li, H.; Liu, T.; Gao, H.J.R.S. A Method for Airborne Small-Target Detection with a Multimodal Fusion Framework Integrating Photometric Perception and Cross-Attention Mechanisms. Remote Sens. 2025, 17, 1118. [Google Scholar] [CrossRef]
Liu, Y.; Chen, X.; Ward, R.K.; Wang, Z. Image fusion with convolutional sparse representation. IEEE Signal Process. Lett. 2016, 23, 1882–1886. [Google Scholar] [CrossRef]
Jin, J.; Zhou, W.; Yang, R.; Ye, L.; Yu, L.J.I.G.; Letters, R.S. Edge detection guide network for semantic segmentation of remote-sensing images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5000505. [Google Scholar] [CrossRef]
Li, Y.; Weiqi, J.; Qiu, S.; Qiyang, S.J.R.S. Multimodal Prompt-Guided Bidirectional Fusion for Referring Remote Sensing Image Segmentation. Remote Sens. 2025, 17, 1683. [Google Scholar] [CrossRef]
Liu, J.; Fan, X.; Jiang, J.; Liu, R.; Luo, Z.; Technology, S. Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 105–119. [Google Scholar] [CrossRef]
Zhang, Y.; Gao, T.; Xie, H.; Liu, H.; Ge, M.; Xu, B.; Zhu, N.; Lu, Z.J.R.S. Narrowband Radar Micromotion Targets Recognition Strategy Based on Graph Fusion Network Constructed by Cross-Modal Attention Mechanism. Remote Sens. 2025, 17, 641. [Google Scholar] [CrossRef]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.-M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16794–16805. [Google Scholar]
Ma, X.; Zhang, X.; Pun, M.-O. A crossmodal multiscale fusion network for semantic segmentation of remote sensing data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3463–3474. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31 × 31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
Yuan, M.; Wei, X. C² former: Calibrated and complementary transformer for rgb-infrared object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403712. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Wang, C.; Tsepa, O.; Ma, J.; Wang, B. Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv 2024, arXiv:2402.00789. [Google Scholar]
Wan, Z.; Zhang, P.; Wang, Y.; Yong, S.; Stepputtis, S.; Sycara, K.; Xie, Y. Sigma: Siamese mamba network for multi-modal semantic segmentation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025; IEEE: Tucson, AZ, USA, 2024. [Google Scholar]
Seichter, D.; Köhler, M.; Lewandowski, B.; Wengefeld, T.; Gross, H.-M. Efficient rgb-d semantic segmentation for indoor scene analysis. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May 5–June 2021; pp. 13525–13531. [Google Scholar]
Hosseinpour, H.; Samadzadegan, F.; Javan, F. CMGFNet: A deep cross-modal gated fusion network for building extraction from very high-resolution remote sensing images. ISPRS J. Photogramm. Remote Sens. 2022, 184, 96–115. [Google Scholar] [CrossRef]
Chen, X.; Lin, K.-Y.; Wang, J.; Wu, W.; Qian, C.; Li, H.; Zeng, G. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 561–577. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Liu, S.; Chen, T.; Chen, X.; Chen, X.; Xiao, Q.; Wu, B.; Kärkkäinen, T.; Pechenizkiy, M.; Mocanu, D.; Wang, Z. More convnets in the 2020s: Scaling up kernels beyond 51 × 51 using sparsity. arXiv 2022, arXiv:2207.03620. [Google Scholar]
Smith, J.T.; Warrington, A.; Linderman, S. Simplified state space layers for sequence modeling. arXiv 2022, arXiv:2208.04933. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Liu, X.; Zhang, C.; Zhang, L. Vision mamba: A comprehensive survey and taxonomy. arXiv 2024, arXiv:2405.04404. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.-O.J.I.G.; Letters, R.S. Rs³ mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
He, X.; Cao, K.; Zhang, J.; Yan, K.; Wang, Y.; Li, R.; Xie, C.; Hong, D.; Zhou, M.J.I.F. Pan-mamba: Effective pan-sharpening with state space model. Inf. Fusion 2025, 115, 102779. [Google Scholar] [CrossRef]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z.J.I.G.; Letters, R.S. Rsmamba: Remote sensing image classification with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Fu, C.; Wang, Y.; Zhang, J.; Jiang, Z.; Mao, X.; Wu, J.; Cao, W.; Wang, C.; Ge, Y.; Liu, Y. Mambagesture: Enhancing co-speech gesture generation with mamba and disentangled multi-modality fusion. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 10794–10803. [Google Scholar]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P. ABCNet: Attentive bilateral contextual network for efficient semantic segmentation of Fine-Resolution remotely sensed imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C.J.I.G.; Letters, R.S. Multistage attention ResU-Net for semantic segmentation of fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8009205. [Google Scholar] [CrossRef]
Ni, Y.; Liu, J.; Cui, J.; Yang, Y.; Wang, X. Edge guidance network for semantic segmentation of high-resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 9382–9395. [Google Scholar] [CrossRef]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and multiscale transformer fusion network for remote-sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2004612. [Google Scholar] [CrossRef]

Figure 1. (a) Visible spectral images, (b) DSM data, (c) shows the segmentation results based on the Transformer network [32], and (d) shows the segmentation results of our model. Obviously, our method can extract more accurate boundary features and efficiently fuse multimodal information, thus achieving more accurate semantic segmentation (red boxes) of remote sensing images.

Figure 3. The structure of the Large Strip Spatial Detail Module.

Figure 4. The structure of the SDMamba Block.

Figure 5. The structure of the MBMamba and BFMamba Block.

Figure 6. Data samples of size 1024 × 1024 from (a,b) Vaihingen, and (c,d) Potsdam.

Figure 7. Qualitative performance comparison on the Vaihingen test set with a size of 1800 × 1800. (a) NIRRG image, (b) DSM, (c) Ground Truth, (d) ABCNet, (e) ESANet, (f) PSPNet, (g) SA-GATE, (h) TransUNet, (i) MFTransUNet, (j) CMGFNet, (k) UNetFormer, (l) CMFNet, (m) CMTFNet, (n) MCSNet, (o) RS3Mamba, and (p) FMLSNet. Two boxes are added to all sub-images to highlight differences.

Figure 8. Qualitative performance comparison on the Potsdam test set with a size of 2000 × 2000. (a) RGB image, (b) DSM, (c) Ground Truth, (d) ABCNet, (e) ESANet, (f) PSPNet, (g) SA-GATE, (h) TransUNet, (i) MFTransUNet, (j) CMGFNet, (k) UNetFormer, (l) CMFNet, (m) CMTFNet, (n) MCSNet, (o) RS3Mamba, and (p) the proposed FMLSNet. Two red boxes are added in all subfigures to highlight the differences.

Figure 9. Edge detection visualization on the Vaihingen (top) and Potsdam (bottom) datasets. The predicted edges closely align with the ground truth boundaries and exhibit significantly reduced intraclass noise, demonstrating the effectiveness of our boundary enhancement strategy.

Figure 10. Four sets of heatmap samples. (a) NIRG image, (b–d) three heatmaps from FMLSNet, and (e) Ground Truth, showing how pixels are determined to belong to buildings, roads, trees, and cars, respectively.

Figure 11. Impact of multi-task loss weights on training performance.

Table 1. Experimental results on the Vaihingen dataset. We provide the OA for five foreground classes and three overall performance metrics. Bolded values represent the best performance.

Type	Method	OA(%)						mIoU (%)	mF1 (%)
Type	Method	Bui.	Tre.	Low.	Car	Imp.	Total	mIoU (%)	mF1 (%)
CNN- based	ABCNet [52]	94.10	90.81	78.53	64.12	89.70	89.25	75.20	85.34
	ESANet [8]	95.69	90.50	77.16	85.46	91.39	90.61	79.40	88.18
	MAResUNet [53]	94.84	89.99	79.09	85.89	92.19	90.17	79.89	88.54
	PSPNet [12]	94.52	90.17	78.84	79.22	92.03	89.94	76.96	86.55
	SA-GATE [40]	94.84	92.56	81.29	87.79	91.69	91.10	81.27	89.81
	CMGFNet [39]	97.75	91.60	80.03	87.28	92.35	91.72	82.26	90.00
	EIGNet [54]	97.11	90.13	81.44	88.02	92.87	91.49	82.66	90.25
Transformer-based	TransUNet [16]	96.48	92.77	76.14	69.56	91.66	90.96	78.26	87.34
	CMFNet [32]	97.17	90.82	80.37	85.47	92.36	91.40	81.44	89.48
	UNetformer [14]	96.23	91.85	79.95	86.99	91.85	91.17	81.97	89.48
	MFTransNet [22]	96.41	91.48	80.09	86.52	92.11	91.22	81.63	89.62
	CMTFNet [55]	96.82	91.04	81.09	90.03	92.83	91.64	82.90	90.41
	MCSNet [15]	96.73	91.70	81.49	87.88	91.70	91.34	82.16	90.00
Mamba-based	RS3Mamba [47]	97.40	92.14	79.56	88.15	92.19	91.64	82.78	90.34
Mamba-based	FMLSNet (Ours)	98.09	92.66	81.98	91.86	93.31	92.30	84.29	91.25

Table 2. Experimental results on the Potsdam dataset. We provide the OA for five foreground classes and three overall performance metrics. Bolded values represent the best performance.

Type	Method	OA(%)						mIoU (%)	mF1 (%)
Type	Method	Bui.	Tre.	Low.	Car	Imp.	Total	mIoU (%)	mF1 (%)
CNN- based	ABCNet [52]	96.23	78.92	85.40	92.92	88.90	87.52	79.26	88.14
	ESANet [8]	97.10	85.31	87.81	94.08	92.76	89.74	84.15	91.22
	MAResUNet [53]	96.82	83.97	87.70	95.88	92.19	89.82	83.61	90.86
	PSPNet [12]	97.03	83.97	85.67	88.81	90.91	88.67	80.36	88.92
	SA-GATE [40]	96.54	81.18	85.35	96.63	90.77	87.91	82.53	90.26
	CMGFNet [39]	97.41	86.80	86.68	95.68	92.60	90.21	84.53	91.40
	EIGNet [54]	96.32	85.24	84.47	94.76	91.36	89.53	83.52	90.79
Transformer-based	TransUNet [16]	96.63	82.65	89.98	93.17	91.93	90.01	83.74	90.97
	CMFNet [32]	97.63	87.40	88.00	95.68	92.84	91.16	85.63	92.10
	UNetformer [14]	97.69	86.47	87.93	95.93	92.27	90.65	85.05	91.71
	MFTransNet [22]	97.37	85.71	86.92	96.05	92.45	89.96	84.04	91.11
	CMTFNet [55]	97.48	88.20	86.52	95.80	90.02	90.05	84.59	91.45
	MCSNet [15]	97.27	84.78	81.49	87.88	91.70	91.34	82.16	90.00
Mamba-based	RS3Mamba [47]	97.70	86.11	89.53	96.23	91.36	90.49	85.01	91.69
Mamba-based	FMLSNet (Ours)	97.90	88.26	90.02	96.47	92.89	91.45	86.31	92.31

Table 3. Results from the ablation study on the network using the Vaihingen dataset.

LSSD	FMB	BAD	OA (%)	mIoU (%)	mF1 (%)
√			91.93	83.32	90.73
	√		91.62	83.45	90.76
√		√	92.07	83.91	91.00
√	√	√	92.30	84.29	91.25

Table 4. DSM analysis experimental results. We present the OA for five foreground classes and three overall performance metrics.

Dataset	Bands	OA(%)						mIoU (%)	mF1 (%)
Dataset	Bands	Bui.	Tre.	Low.	Car	Imp.	Total	mIoU (%)	mF1 (%)
Vaihingen	NIRRG	96.88	91.41	80.53	90.12	92.70	91.22	83.44	90.76
Vaihingen	NIRRG + DSM	98.09	91.66	81.98	91.86	93.31	92.30	84.29	91.25
Potsdam	RGB	96.81	86.97	86.43	95.38	91.56	90.67	85.91	91.77
Potsdam	RGB + DSM	97.90	88.26	87.52	96.47	92.89	91.45	86.31	92.31

Table 5. Ablation study on different kernel size configurations of the LSSD module and their impact on Params, FLOPs, and mIoU on the Vaihingen dataset. Bold numbers indicate the best result in each column.

Kernel Design	Params (M)	FLOPs (G)	mIoU (%)
(3, 3, 3, 3)	14.45	12.33	82.79
(5, 7, 9, 11)	16.58	16.29	83.27
(11, 13, 15, 17)	19.37	20.91	84.29
(11, 11, 11, 11)	18.95	19.89	84.03
(15, 15, 15, 15)	21.89	22.31	84.12

Table 6. Computational complexity analysis measured by a single NVIDIA GeForce RTX 4070Ti Super GPU using 256 × 256 images. mIoU is the result from the Vaihingen dataset. Bolded values represent the best performance.

Method	Multimodal	FLOPs (G)	Parameter (M)	mIoU (%)
ABCNet [52]	N	3.9	13.39	75.20
MAResUNet [53]	N	8.79	26.27	79.89
UNetFormer [14]	N	6.04	24.20	81.97
MCSNet [15]	N	29.64	54.74	82.16
CMTFNet [55]	N	17.14	30.07	82.90
RS3Mamba [47]	N	31.65	43.32	82.62
ESANet [8]	Y	7.73	34.03	79.42
TransUNet [16]	Y	32.27	93.23	78.26
SA-GATE [40]	Y	41.28	110.85	81.44
MFTransUNet [22]	Y	8.44	43.77	81.61
CMFNet [32]	Y	78.25	123.63	81.44
CMGFNet [39]	Y	19.51	64.20	82.26
FMLSNet (Ours)	Y	38.08	64.11	84.29

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, L.; Feng, Q.; Wang, J.; Cao, J.; Feng, X.; Tang, X. A Multilevel Multimodal Hybrid Mamba-Large Strip Convolution Network for Remote Sensing Semantic Segmentation. Remote Sens. 2025, 17, 2696. https://doi.org/10.3390/rs17152696

AMA Style

Yan L, Feng Q, Wang J, Cao J, Feng X, Tang X. A Multilevel Multimodal Hybrid Mamba-Large Strip Convolution Network for Remote Sensing Semantic Segmentation. Remote Sensing. 2025; 17(15):2696. https://doi.org/10.3390/rs17152696

Chicago/Turabian Style

Yan, Lingyu, Qingyang Feng, Jing Wang, Jinshan Cao, Xiaoxiao Feng, and Xing Tang. 2025. "A Multilevel Multimodal Hybrid Mamba-Large Strip Convolution Network for Remote Sensing Semantic Segmentation" Remote Sensing 17, no. 15: 2696. https://doi.org/10.3390/rs17152696

APA Style

Yan, L., Feng, Q., Wang, J., Cao, J., Feng, X., & Tang, X. (2025). A Multilevel Multimodal Hybrid Mamba-Large Strip Convolution Network for Remote Sensing Semantic Segmentation. Remote Sensing, 17(15), 2696. https://doi.org/10.3390/rs17152696

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multilevel Multimodal Hybrid Mamba-Large Strip Convolution Network for Remote Sensing Semantic Segmentation

Abstract

1. Introduction

1.1. Related Works

1.2. Contribution

2. Method

2.1. Overview of FMLSNet

2.2. Large Strip Spatial Detail Module

2.3. Mamba-Based Deep Feature Fusion

2.4. Multitask Learning-Based Cascaded Decoder

3. Results

3.1. Experimental Settings

3.2. Datasets

3.3. Experimental Results

3.4. Ablation Study

3.5. Hyperparameter Selection

3.6. Model Parameters and Computation Complexity Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI