Remote Sensing Image Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion

Cao, Yice; Liu, Chenchen; Wu, Zhenhua; Zhang, Lei; Yang, Lixia

doi:10.3390/rs17081390

Open AccessArticle

Remote Sensing Image Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion

by

Yice Cao

¹

,

Chenchen Liu

¹

,

Zhenhua Wu

^1,*

,

Lei Zhang

²

and

Lixia Yang

¹

School of Electronics and Information Engineering, Anhui University, Hefei 230601, China

²

School of Electronics and Communication Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(8), 1390; https://doi.org/10.3390/rs17081390

Submission received: 3 March 2025 / Revised: 6 April 2025 / Accepted: 11 April 2025 / Published: 14 April 2025

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Rapid advancements in remote sensing (RS) imaging technology have heightened the demand for the precise and efficient interpretation of large-scale, high-resolution RS images. Although segmentation algorithms based on convolutional neural networks (CNNs) or Transformers have achieved significant performance improvements, the trade-off between segmentation precision and computational complexity remains a key limitation for practical applications. Therefore, this paper proposes CVMH-UNet—a hybrid semantic segmentation network that integrates the Vision Mamba (VMamba) framework with multi-scale feature fusion—to achieve high-precision and relatively efficient RS image segmentation. CVMH-UNet comprises the following two core modules: the hybrid visual state space block (HVSSBlock) and the multi-frequency multi-scale feature fusion block (MFMSBlock). The HVSSBlock integrates convolutional branches to enhance local feature extraction while employing a cross 2D scanning method (CS2D) to capture global information from multiple directions, enabling the synergistic modeling of global and local features. The MFMSBlock introduces multi-frequency information via 2D Discrete Cosine Transform (2D DCT) and extracts multi-scale local details through point-wise convolution, thereby optimizing refined feature fusion in skip connections between the encoder and decoder. Experimental results on benchmark RS datasets demonstrate that CVMH-UNet achieves state-of-the-art segmentation accuracy with optimal computational efficiency, surpassing existing advanced methods.

Keywords:

remote sensing image; semantic segmentation; vision Mamba; multi-frequency multi-scale feature fusion; CS2D

1. Introduction

Semantic segmentation, a key research area in interpreting remote sensing (RS) images, is extensively utilized in tasks including urban development projects [1,2,3], topographic mapping [4,5,6,7,8], strategic land resource allocation [9], road network extraction [10], and environmental surveillance [11,12]. With the continuous improvement in imaging resolution and the increasing width of imaging swaths, the volume and complexity of the acquired RS images significantly increase. This imposes higher demands on segmentation algorithms to more comprehensively capture target information and improve interpretation efficiency. Traditional segmentation algorithms, including Support Vector Machines (SVMs) [13], Random Forests (RF) [14], and Conditional Random Fields (CRFs) [15], exhibit restricted capacity in target information extraction when processing high-definition RS images, making it challenging to ensure segmentation accuracy. Furthermore, these algorithms struggle to meet the demands of real-time image interpretation, as their computational efficiency often falls short of practical requirements.

Following the ascendancy of neural networks, data-driven deep learning methods automatically learn and extract more robust and discriminative features of ground objects from RS images. These methods exhibit notable efficacy in the domain of RS image segmentation, surpassing many traditional methods in both accuracy and efficiency. In particular, the UNet [16], which uses a convolutional neural network (CNN) as its foundational architecture, gains widespread recognition and application within the domain of RS image segmentation. This success can be attributed to its unique U-shaped encoder–decoder structure and skip connection design. A host of further research on deep segmentation algorithms improve upon the distinctive U-shaped architecture of UNet to capture richer information and enhance segmentation accuracy. With the rise of Transformer models, their advantages in global information modeling are fully explored and applied to the field of image segmentation [17,18,19,20]. Segmentation algorithms based on the Transformer framework effectively address the issue of insufficient global information extraction caused by the limited receptive field of CNNs, thereby further improving segmentation accuracy. Swin-UNet [21], which combines the Swin Transformer with a U-shaped architecture, is the first purely Transformer-based U-shaped model. However, the self-attention mechanism in the Transformer architecture results in excessively high computational complexity, making it challenging for the model’s computational performance to meet the demands of practical applications. To address this issue, researchers selectively incorporate Transformer architectures into either the encoder or decoder to achieve a balance between model accuracy and computational complexity. UNetFormer [22] leverages the combination of a CNN-based encoder and a Transformer-based decoder to adeptly capture global and local contexts, while this approach effectively avoids excessive computational complexity, the CNN-based encoder hinders the model from achieving higher accuracy levels. The limitations of these models motivate us to develop a novel framework that efficiently captures global context while maintaining a linear relationship in computational complexity.

Recently, the Mamba framework [23], which is grounded in the State Space Model (SSM), has become a promising alternative due to its ability to capture distant dependencies while maintaining linear computational complexity. Following this, Vim [24] and VMamba [25] further extend the advantages of SSM to the discipline of image processing, providing new momentum for technological advancements in this area. Particularly within the domain of RS image analysis, researchers deeply explore the application potential of SSM. For example, research on Rs3Mamba [26], RTMamba [27], and CM-UNet [28] proves that VMamba can serve as a viable substitute for conventional CNNs and Transformers in segmenting RS images. These studies show that VMamba not only captures global contextual information but also maintains lower computational complexity, introducing a contemporary and efficient solution for the partitioning of RS images. Although these methods take advantage of the strengths of VMamba in global information extraction to some extent, they often overlook the limitations in local information extraction that may arise due to specific scanning directions. Therefore, there is a requirement for further investigation to improve the structure of VMamba, enhancing its ability to extract local detail information while maintaining its efficiency in capturing global information. This improvement will help formulate advanced semantic segmentation methods that are both precise and efficient for RS images.

In addition to addressing the balance between model accuracy and complexity by applying VMamba in the encoding–decoding architecture, it is also crucial to resolve the key issue of feature fusion in traditional skip connections. This resolution will further improve the model’s segmentation performance. Conventional skip connections merge features by linking low-level encoder features directly to high-level decoder features. However, this simple concatenation method fails to fully exploit the complementarity between features at different levels. This limitation restricts the ability of the network to differentiate between nuanced low-level information and high-level semantic aspects. As a result, the features become coarsely integrated, causing the loss of minor object details. To resolve these concerns, researchers increasingly focus on the significance of attention mechanisms and multi-scale features in semantic segmentation. The AFF module [29], Multiattention Network (MANet) [30], AFENet [31], and TDBAN [32] demonstrate that multi-scale feature fusion strategies based on attention mechanisms effectively address the issues in skip connections. However, ECANet [33] points out that the dimensionality reduction in fully connected layers within attention mechanisms leads to information loss. Additionally, FCANet [34] shows that the traditional global average pooling method causes significant information loss. Therefore, overcoming issues in attention mechanisms and designing a feature fusion strategy that enhances the completeness and transfer precision of data is crucial. This strategy aims to achieve sophisticated merging of intricate RS images and enhance the accurate segmentation of minor objects, which represents a key challenge in the current research on RS semantic segmentation methods.

To tackle the previously mentioned difficulties, this paper proposes a novel hybrid semantic segmentation network based on VMamba for RS image segmentation, called CVMH-UNet. The core of CVMH-UNet lies in the hybrid visual state space block (HVSSBlock) and the multi-frequency multi-scale feature fusion block (MFMSBlock). The HVSSBlock is an enhanced version of the VSSBlock, incorporating three key improvements as follows: (1) Integration of convolutional branches: It addresses the insufficient extraction of local information in the VSSBlock by integrating convolutional branches, enabling the comprehensive extraction of both global and local features. (2) Optimized scanning mechanism: The HVSSBlock replaces the SS2D scanning method (horizontal, vertical, and reverse paths) with the cross 2D scanning method (CS2D), as shown in Figure 1, which includes horizontal, vertical, diagonal, and anti-diagonal paths. This allows for more comprehensive global information capture from multiple directions, thereby overcoming the limitations of the SS2D method. (3) Residual connections: Residual connections are used to optimize the feature extraction process, minimizing information loss and enhancing overall efficiency. In addition, to enhance the integrity and transmission precision of information in feature fusion, this paper proposes a multi-frequency multi-scale feature fusion block (MFMSBlock). This block employs a structure with two branches to acquire information at varying scales. Within the global pathway, channel attention is treated as a compression problem, utilizing 2D discrete cosine transform (2D DCT) to compress the channels. By introducing multiple frequency components, it provides global feature channel attention to enrich the representation of the channels and enhance information integrity. At the same time, it directly calculates channel weights using one-dimensional convolution with adaptive kernel sizes (Adaptive 1D Conv). Avoiding the information loss that could be caused by the dimensionality reduction in fully connected layers, this method ensures the accuracy of information transmission. In the local branch, point-wise convolution is used to provide local feature channel attention at a different scale from the global branch, capturing complex details and more extensive structural information. The two branches aggregate multi-scale information along the channel dimension, ensuring the effective utilization of features at all levels. This allows both lower-level and higher-level features to be fully utilized and complementary in the fusion process, achieving refined feature fusion. The primary contributions of this research are outlined as follows:

A VMamba-based feature extraction module, named HVSSBlock, is proposed. This module enhances local feature extraction through an integrated convolutional branch while simultaneously strengthening global feature acquisition via an optimized scanning strategy. Residual connections are incorporated to refine the feature extraction process and minimize information loss. The combined use of these strategies enables HVSSBlock to achieve comprehensive global-local information capture with lower computational complexity.
The MFMSBlock, a feature fusion module, is proposed to serve as a replacement for traditional skip connections. This module introduces multi-frequency information through 2D DCT while employing Adaptive 1D Conv to mitigate information loss and enhance transmission accuracy. Simultaneously, it extracts multi-scale local details via point-wise convolution, ultimately achieving refined feature fusion between the encoder and decoder.
An efficient U-shaped architecture network, named CVMH-UNet, is proposed based on HVSSBlock and MFMSBlock. Extensive experiments on multiple public RS datasets demonstrate that this method achieves superior segmentation accuracy while maintaining lower computational complexity.

The subsequent pages of this study are outlined in the following order. Section 2 introduces the related work surveyed in this study. Section 3 details the architecture of the proposed approach and the particular setup of each component. Section 4 presents the experimental setup and technical metrics. Section 5 affirms the superiority of the proposed method employing three classic datasets. Finally, Section 6 provides a summary of the research findings and insights from the study.

2. Related Work

2.1. Vision State Space Models

Recently, SSM became a research hotspot. As the current leading SSM, Mamba not only excels in long-range modeling capabilities but also demonstrates linear complexity in handling input sizes, addressing the computational efficiency issues of Transformers in modeling long sequences of state spaces. In the field of visual research, Vim [24] proposes a pure vision backbone model based on SSM, introducing Mamba into the visual domain for the first time. VMamba [25] enhances visual processing capabilities by introducing the Cross-Scan module. This module enables the model to selectively scan images in two dimensions and demonstrates superiority in image classification tasks. LocalMamba [35] focuses on a windowed scanning strategy for visual spatial models, optimizing visual information to capture local dependencies and introducing a dynamic scanning method to search for optimal choices at different layers. In downstream visual tasks, Mamba also sees extensive use in the domain of RS image segmentation research. Rs3Mamba [26] proposes using the VSSBlock to provide additional global information, aiding the convolution-centered main stream in the extraction of features, thereby enhancing global awareness of the model while preserving local details. In CM-UNet [28], a CNN encoder is tasked with acquiring detailed image features, and a Mamba decoder is in charge of combining global information. The inclusion of an MSAA module enables the integration of features at various scales, effectively capturing long-distance dependencies and multi-scale global context information in high-resolution RS images. MambaHSI [36] proposes a novel Mamba-based model for HSI classification, which captures long-range dependencies across the entire image while adaptively fusing spatial-spectral information through an adaptive mechanism. Although these strategies utilize the strong ability of VMamba to model long-range dependencies with linear computational complexity, they overlook the limitations in local information extraction that may arise due to specific scanning directions.

2.2. Attention Mechanisms in Deep Learning

SENet [37] introduces the channel attention mechanism, which performs global average pooling (GAP) along the channel dimension and uses fully connected layers to compute the weight of each channel. Due to its significant performance improvements, attention modules gain widespread attention. Subsequent studies, such as CBAM [38] and scSE [39], use 2D convolution with a kernel size of

k \times k

to calculate spatial attention and integrate it with channel attention. SRM [40] suggests using GAP with global standard deviation pooling. In semantic segmentation tasks, attention mechanisms are widely used. MsASNet [41] designs an encoder–decoder structure to enhance the landslide boundaries and combines channel attention mechanisms to boost feature extraction capabilities. AAFormer [42] proposes a novel multi-head attention participation module (AAM). This module highlights informative context by considering the correlation between self-attention maps and query vectors. CAGNet [43] designs a category attention guidance module that combines Transformer and CNN. It integrates the proposed category attention into both the global scoring function and local category feature weights. Deeplabv3 plus [44] applies the channel attention mechanism to low-level features, reducing computational complexity and enhancing the clarity of object boundaries. It introduces polarized self-attention after the ASPP module, improving the spatial feature representation of feature maps. Although these attention mechanisms perform well in practice, the dimensionality reduction in the fully connected layers of GAP negatively impacts the attention mechanism. Therefore, ECANet [33] proposes using a Adaptive 1D Conv to replace the fully connected layer, avoiding the negative effects of dimensionality reduction and capturing local cross-channel interactions. Additionally, a pivotal aspect of channel attention mechanisms is the calculation of scalar values for individual channels. Due to its uncomplicated design, GAP struggles to effectively extract complex input features. FCANet [34] demonstrates that GAP represents a unique instance of the Discrete Cosine Transform (DCT) and explores different frequency component combinations within a multi-spectral framework, which enhances the ability of GAP to capture complex information. These improved attention mechanisms, by more precisely handling inter-channel and intra-channel interactions, help models better understand and utilize input data features. Therefore, effectively applying these methods in feature fusion strategies becomes key to achieving refined feature fusion.

2.3. Skip Connections in Deep Learning

In U-shaped networks, long skip connections are an important component. These connections enable the network to capture detailed semantic information by integrating fine-grained details from lower-resolution layers with higher-level semantic features from deep layers. Although skip connections are widely used to combine features from various paths, the fusion of these connected features is typically achieved through addition or concatenation. This approach ignores the differences between the features and assigns the same weight to all of them. This prevents the complementary nature of features at different levels from being fully utilized. As a result, the capability of the network to recognize both minor details and advanced semantic features is limited. With the widespread application of attention mechanisms, SKNet [45] introduces a method for specialized feature fusion based on attention mechanisms that uses a non-linear approach. However, its constraints are apparent in its focus only on soft feature selection within the same layer, overlooking the challenge of integrating features across layers in skip connections. Moreover, the scale variation of objects represents a significant challenge within the realm of computer vision. To mitigate the issues arising from scale variation and small objects, as well as to enhance the effectiveness of skip connections, the AFF module [29] proposes a unified and generalized method for feature fusion using attention mechanisms. In semantic segmentation, Multiattention Network (MANet) [30] extends attention mechanisms across multiple scales, aggregating contextual information at different scales to obtain a more complete and nuanced merging of features. MBT-UNet [46] introduces an efficient Feature Fusion Module (FFM) to optimize the collaboration and integration of features across different scales. SABNet [47] proposes local embedding and coordinates attention fusion modules during feature fusion. These modules reduce attention dispersion and efficiently integrate low- and high-level features. SFCRNet [48] proposes an SFN with a step-shaped architecture and the corresponding fusion modules. This design ensures that rich semantic information from high-level features is continuously transmitted to low-level layers. Although these feature fusion methods can effectively improve the effectiveness of long skip connections, they often overlook the issues of information loss during transmission and inadequate utilization of information in attention mechanisms. This issue has been highlighted in ECANet [33] and FCANet [34].

3. Methodology

3.1. Overall Architecture

The complete network structure of the proposed CVMH-UNet is depicted in Figure 2a, which is designed and improved based on VMamba. The CVMH-UNet architecture primarily comprises an embedding block, an encoder, a decoder, four MFMSBlocks, and a final projection layer. The original RS image

x \in R^{3 \times H \times W}

is segmented into non-intersecting

4 \times 4

patches in the embedding block and mapped to a C-dimensional feature space. The normalized features yield an embedded feature map

x^{'} \in R^{C \times \frac{H}{4} \times \frac{W}{4}}

with channel dimension C fixed at 96. These features are then fed into a HVSSBlock-based encoder for hierarchical feature extraction. In the first three stages, patch merging is applied to reduce spatial dimensions progressively and extract multi-scale hierarchical features. The decoder, constructed with HVSSBlocks, progressively restores feature resolution through patch-expanding operations in the final three stages. In these stages, upsampling layers not only recover spatial details but also refine high-level semantic representations. The final projection layer generates the segmentation result, aligning the spatial dimensions of the output with the original input resolution. Additionally, to better restore details and capture more target information, the decoder interfaces with the encoder through the MFMSBlock for feature fusion. This approach ensures strong information integrity and precision in information transmission. It strengthens the discriminative capacity of the features and refines the detailed fusion of small objects. As a result, the segmentation outcomes are more accurate and exact.

3.2. HVSSBlock

The CVMH-UNet encoder–decoder architecture employs the proposed HVSSBlock as its core component, derived from the VSSBlock with structural enhancements. The advantage of this module lies in its higher feature representation capability, achieved through the integration of convolutional branches and optimized scanning methods. Inspired by VMamba [25] and the deficiency offset in local information extraction resulting from the SS2D scanning method in the VSSBlock, we designed a structure with two branches in the HVSSLayer by adding a convolutional branch to capture local information. Additionally, to address the inefficient feature capture caused by bidirectional scanning along identical paths in the SS2D method of VMamba, our solution optimizes this approach, using the CS2D method with intersecting scanning paths orientations. This allows the extraction of spatial features from multiple directions without increasing computational complexity, making it more efficient in capturing global information from high-resolution RS images.

As can be seen in Figure 2d, two branches are applied on the input feature

F_{i n}

to obtain global and local information. More specifically, within the global pathway, the input features first go through the Cross Mamba module (CM) in the CS2D scanning method to extract global information, and then, the global information is input to the channel attention (CA) mechanism to obtain the global feature

F_{g}

. This system leverages the interdependencies among channel mappings to enhance the representation of specific semantic features. In the CM module, as shown in Figure 3b, after layer normalization, input

F_{i n}

is channeled into two distinct branches. The initial branch includes a linear layer and an activation step. The other branch encompasses a linear layer, a depthwise separable convolution, and an activation step, before entering the CS2D module for extracting global features across various directions. After this, the features undergo normalization and are subsequently merged element-wise with the output from the initial branch. In conclusion, a linear layer blends the features, which are subsequently integrated with the residual connection to produce the final output of the CM module. In the local branch, the input features first pass through a

3 \times 3

convolution to extract local information, which is then enhanced by the Spatial Attention (SA) mechanism to emphasize local details and suppress irrelevant regions, resulting in the local feature

F_{l}

. To sum up, the process can be represented as follows:

F_{g} = C A (C M (F_{i n}))

(1)

F_{l} = S A (C o n v_{3 \times 3} (F_{i n}))

(2)

where

C o n v_{3 \times 3} (\cdot)

represents a

3 \times 3

convolution,

C M (\cdot)

represents the Cross Mamba module, and

C A (\cdot)

and

S A (\cdot)

represent the channel attention mechanism and spatial attention mechanism, respectively. Afterward, the global and local features are added together, followed by depthwise separable convolution, layer normalization, and a 1 × 1 convolution to refine the combined features. These refined features are then merged with the starting input

F_{i n}

through a residual connection, generating the fused feature

F_{u}

. Finally, to enhance feature representation, the E-FFN module is introduced to improve information interaction between different channels, achieving better segmentation performance. The E-FFN module replaces the FC layers in FFN [49] with 1 × 1 convolutions and inserts two parallel 3 × 3 and 5 × 5 depthwise separable convolutions between them. This structural redesign endows the E-FFN with intrinsic convolutional feature extraction capabilities, making it particularly effective for processing 2D image data. The depthwise separable convolution [50] first applies depthwise convolution independently to each input channel, and then, it performs channel-wise linear combinations through point-wise convolution. This dual-stage design strengthens cross-channel information interaction while maintaining computational efficiency. Furthermore, ReLU activation functions are added following each depthwise separable convolution layer, injecting non-linear factors that enable the model to capture more complex relationships and enhance representational capacity. This process can be represented as follows:

F_{u} = F_{i n} + C o n v_{1 \times 1} (L N (d w C o n v_{3 \times 3} (F_{g} + F_{l})))

(3)

F_{o u t} = F_{u} + E N (F_{u})

(4)

where

F_{o u t}

denotes the output features of the HVSSLayer,

d w C o n v_{3 \times 3} (\cdot)

denotes the

3 \times 3

depth-separable convolution,

L N (\cdot)

denotes layer normalization,

C o n v_{l \times l} (\cdot)

denotes the

1 \times 1

convolution, and

E N (\cdot)

denotes the E-FNN module.

Additionally, to optimize the feature extraction process and reduce information loss, enabling the network to better extract features at each layer, two HVSSLayers are connected using a residual connection to form a HVSSBlock. As shown in Figure 2b, this approach reduces information loss, helping the network better identify and segment targets in high-resolution RS images.

3.3. MFMSBlock

The encoder and decoder of CVMH-UNet are capable of extracting low-level features with rich texture details and high-level features with semantic information, respectively. To utilize this information more effectively, the MFMSBlock is proposed. The fundamental principle involves incorporating frequency information into global pooling while integrating further local scale analysis, thereby developing a multi-frequency, multi-scale feature fusion mechanism. This module employs a dual-branch structure to capture information across different scales. Within the global branch, 2D DCT is employed to integrate frequency components into global pooling. Multiple pooling strategies are then applied to extract multi-band spectral features, thereby improving the comprehensiveness of feature representation. Furthermore, Adaptive 1D Convolution [33] is leveraged to derive channel attention weights directly, avoiding the information loss that could be caused by the dimensionality reduction in fully connected layers. This design both enhances information integrity and improves the precision of information transmission. In the local branch, point-wise convolution is used to provide local feature channel attention at another scale, capturing complex details and broader structural information. Finally, the two branches aggregate information from different scales along the channel dimension. The fusion approach not only strengthens the capacity of the model to detect local texture details but also enhances its comprehension of global semantic frameworks, substantially elevating the overall efficacy of the model.

The overall structure of the proposed MFMSBlock is depicted in Figure 2c. The feature map

X_{i}

, obtained by adding the low-level feature map

F_{i}

from the i-th layer of the encoder and the high-level semantic feature map

{\tilde{F}}_{i}

from the i-th layer of the decoder, passes through the multi-frequency multi-scale attention mechanism (MFMS-AM) to generate fusion weights

X_{i}^{'}

and

1 - X_{i}^{'}

. This allows the MFMSBlock to conduct a soft selection process or a weighted mean between

F_{i}

and

{\tilde{F}}_{i}

. This module can be expressed as follows:

Z_{i} = M A (F_{i} + {\tilde{F}}_{i}) \times F_{i} + (1 - M A (F_{i} + {\tilde{F}}_{i})) \times {\tilde{F}}_{i}

(5)

where

Z_{i} \in R^{C \times H \times W}

represents the fused features, and

M A (\cdot)

represents the attention weight values generated by MFMS-AM.

The overall structure of MFMS-AM is shown in Figure 4. It uses 2D DCT [34] to express image features as a weighted sum of cosine functions at different frequencies. By leveraging the advantages of 2D DCT in capturing multi-frequency features [51], it addresses the information loss issue of traditional average pooling, thereby enhancing information integrity. The representation at the k-th frequency is denoted as

X_{i}^{k}

, and is calculated using the following formula:

X_{i}^{k} = \sum_{h = 0}^{H - 1} \sum_{w = 0}^{W - 1} {(X_{i})}_{:, h, w} D_{h, w}^{u_{k}, v_{k}}

(6)

where

{(X_{i})}_{:, h, w}

denotes the values extracted from all channels of at spatial position

(h, w)

,

D_{h, w}

represents the frequency spectrum of size

h \times w

generated by 2D DCT, and

D_{h, w}^{u_{k}, v_{k}}

represents the frequency component at index

(u_{k}, v_{k})

, which corresponds to the DCT Basis shown in Figure 4. In addition, the 2D DCT basis image using the top-K selection strategy [34] is defined as follows:

D_{h, w}^{u_{k}, v_{k}} = cos (\frac{π h}{H} (u_{k} + \frac{1}{2})) cos (\frac{π w}{W} (ν_{k} + \frac{1}{2}))

(7)

The hyperparameter K is experimentally configured as 32 in our implementation. Next,

X_{i}^{k}

is compressed by applying global average pooling, global max pooling, and global min pooling, respectively. These are then passed through a Adaptive 1D Conv [33] to avoid dimensionality reduction and directly compute the weights. With the channel dimension C as a reference, the kernel size

ϕ

can be dynamically calculated by applying the following equation:

ϕ = {|\frac{{log}_{2} (C)}{α} + \frac{β}{α}|}_{o d d}

(8)

where

{|\begin{matrix} λ \end{matrix}|}_{o d d}

denotes the nearest odd number to

λ

. In the experiments conducted for this paper,

α

and

β

are uniformly set to 2 and 1, respectively. Finally, the branches are summed to obtain the global channel attention map

G (X_{i}) \in R^{C}

. Additionally, the proposed MFMS-AM follows the concept of the Multi-Scale Channel Attention Module (MS-CAM) [29], where an additional point-wise convolution branch provides local feature channel attention at another scale. It only leverages point-wise interactions at each spatial location to enable channel interaction. The local feature channel attention map

L (X_{i}) \in R^{C \times H \times W}

is acquired via the subsequent bottleneck architecture:

L (X_{i}) = ψ (p w C o n ν_{2} (θ (ψ (p w C o n ν_{1} (X_{i})))))

(9)

where

p w C o n v_{1} (\cdot)

and

p w C o n v_{2} (\cdot)

represent point-wise convolutions,

θ (\cdot)

denotes the rectified linear unit (ReLU), and

ψ (\cdot)

denotes batch normalization (BN). It is worth noting that

L (X_{i})

, which aligns with the shape of the input features, allows it to preserve and highlight subtle details in low-level features. Given

G (X_{i})

and

L (X_{i})

, MFMS-AM can produce the fusion weight

X_{i}^{'}

through the following activation function (

S i g m o i d

):

X_{i}^{'} = S i g m o i d (G (X_{i}) + L (X_{i}))

(10)

The introduction of multi-frequency information improves the integrity of the information, and the Adaptive 1D Conv [33] enhances the precision of information transmission. This allows the MFMSBlock to more effectively utilize the feature information produced by the encoder and decoder. Furthermore, the point-wise convolution branch supplements complementary extra scale information, equipping the MFMSBlock with enhanced multi-scale object characterization capabilities. This design explicitly addresses scale variance in high-resolution RS imagery. Applying these strategies ensures that both low-level and high-level features are fully utilized and complement each other during the fusion process, resulting in refined feature fusion.

3.4. Remote Sensing Image Segmentation Based on CVMH-UNet

The preparation of the training set for network training involves collecting original RS images and their corresponding labels. Next, prepare the data by adjusting image dimensions to meet the network’s input criteria, apply data augmentation and normalization, and generate pixel-level labels specifying the class of each pixel. Once processed, input the data into CVMH-UNet. First, feature extraction is conducted via the HVSSBlocks-based encoder, and feature reconstruction is subsequently achieved by the decoder, which shares the same HVSSBlocks architecture. The MFMSBlock is subsequently employed to integrate the multi-level features produced by both the encoder and decoder layers. The final projection layer adjusts the output channels to the number of target classes, generating per-pixel probabilities for class membership.

During training, the model is optimized by minimizing Cross-Entropy Loss and Dice Loss, which measure the discrepancy between predictions and ground truth labels. Backpropagation is then performed to adjust the model parameters, ensuring the model conforms more precisely to the training data. After setting a reasonable number of iterations, the weights of the model are preserved once the loss function stabilizes, and these weights are used for predictions.

4. Dataset and Experimental Setting

4.1. Datasets

4.1.1. ISPRS Vaihingen Dataset

The dataset consists of 16 high-resolution RS images with a resolution of 9 cm, each image averaging approximately

2500 \times 2000

pixels and composed of RGB images with green, red, and near-infrared channels. The dataset includes five foreground classes as follows: impervious surfaces, buildings, low vegetation, trees, and cars, along with one background class (clutter). The training set is constituted by images with index numbers 1, 3, 23, 26, 7, 11, 13, 28, 17, 32, 34, and 37 out of the 16 images. The test set consists of images with index numbers 5, 21, 15, and 30.

4.1.2. ISPRS Potsdam Dataset

The dataset consists of 38 high-resolution RS images with a resolution of 9 cm, each image averaging approximately

6000 \times 6000

pixels and composed of RGB images with blue, red, and green channels. In alignment with the Vaihingen dataset, it includes six classes as follows: impervious surfaces, buildings, low vegetation, trees, cars, and one background class (clutter). We use the images with the following IDs as the test set:

2_{-} 13

,

2_{-} 14

,

3_{-} 13

,

3_{-} 14

,

4_{-} 13

,

4_{-} 14

,

4_{-} 15

,

5_{-} 13

,

5_{-} 14

,

5_{-} 15

,

6_{-} 13

,

6_{-} 14

,

6_{-} 15

, and

7_{-} 13

. Meanwhile, the training set consists of 23 images, excluding image

7_{-} 10

, as it contains errors in its annotations.

4.1.3. Gaofen Image Dataset with 15 Categories (GID-15)

The GID-15 dataset consists of 10 high-resolution satellite images, with an average size of

7200 \times 6800

pixels per image. These images are also RGB images composed of red, green, and blue channels, with a resolution of 0.8 m. The dataset includes sixteen categories. In the experiments, the GID-15 images were cropped into

512 \times 512

pixel patches, totaling 1820 patches, which were then randomly divided into training and testing sets at a ratio of 60% to 40%.

4.2. Experimental Setting

The experiments were conducted on a single NVIDIA GeForce RTX 4080 GPU with 16 GB RAM. The AdamW optimizer (Pytorch2.1.1) was employed, with a base learning rate of 0.001, incorporating learning rate scheduling via the CosineAnnealingLR policy. Training configurations included a batch size of 5 and a maximum of 300 training epochs. To ensure the rigor and fairness of the experiments, CVMH-UNet used the same experimental settings across all datasets.

We conducted comprehensive benchmarking of CVMH-UNet against leading semantic segmentation methods. The comparison involved models that employ CNN and Transformer methodologies, such as BANet [52], ABCNet [53], UNetFormer [22], A2-FPN [54], MAResU-Net [55], MANet [30], CMTFNet [56], as well as Rs3Mamba [26] and CM-UNet [28], based on Vision Mamba. It is important to note that data augmentation and enhancement can significantly impact model performance. To ensure fairness in experiments, all models must use the same data processing methods, preventing performance discrepancies caused by differences in augmentation and enhancement techniques. In all experiments conducted in this paper, we did not employ data augmentation or complex data enhancement techniques. Therefore, the model’s results may differ from those reported in the original paper. Unless otherwise specified, the optimal results are presented in bold within the tables.

4.3. Evaluation Metrics

Assessing the performance of our model involved three metrics—the mean intersection over union (mIoU), mean F1 score (mF1), and mean accuracy (mAcc)—to compare it with the most recent state-of-the-art methods. The following explanation pertains to the calculation of the mIoU, mF1 and mAcc using the collective confusion matrix as follows:

mIoU = \frac{1}{N} \sum_{k = 1}^{N} \frac{{TP}_{k}}{{TP}_{k} + {FP}_{k} + {FN}_{k}}

(11)

mF 1 = \frac{1}{N} \sum_{k = 1}^{N} (2 \times \frac{{precision}_{k} \times {recall}_{k}}{{precision}_{k} + {recall}_{k}})

(12)

mAcc = \frac{1}{N} \sum_{k = 1}^{N} \frac{{TP}_{k}}{{TP}_{k} + {FN}_{k}}

(13)

{precision}_{k} = \frac{{TP}_{k}}{{TP}_{k} + {FP}_{k}}

(14)

{recall}_{k} = \frac{{TP}_{k}}{{TP}_{k} + {FN}_{k}}

(15)

where

{TP}_{k}

,

{FP}_{k}

,

{TN}_{k}

, and

{FN}_{k}

represent the true positives, false positives, true negatives, and false negatives for class k, respectively. In addition, we evaluate the model’s complexity by considering its floating-point operations per second (FLOPs) and the total number of parameters. Higher values of mIoU, mF1, and mAcc indicate better model performance, while lower values of FLOPs and Params indicate a more lightweight model.

4.4. Loss Functions

The loss function for our multi-class segmentation task comprises a hybrid formulation combining Cross-Entropy Loss and Dice Loss, formally defined as follows:

L_{t o t a l} = L_{C e} + L_{D i c e}

(16)

L_{C e} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} log ({\hat{y}}_{i, c})

(17)

L_{D i c e} = 1 - \frac{2 | X \cap Y |}{| X | + | Y |}

(18)

where

L_{t o t a l}

denotes the total loss of the model, formulated as the weighted combination of individual loss terms. N represents the total number of samples, and C represents the number of classes.

y_{i, c}

is an indicator function that takes the value 1 if sample i belongs to class c, and 0 otherwise.

{\hat{y}}_{i, c}

denotes the model’s predicted probability that sample i belongs to class c.

| X |

and

| Y |

denote the ground truth labels and predicted results, respectively.

| X \cap Y |

represents the intersection between X and Y.

5. Experimental Results and Analysis

5.1. Comparison with State-of-the-Art Methods on the ISPRS Vaihingen Dataset

Consistent with established methodologies in the field, the background class (clutter) is excluded from accuracy metric calculations due to its marginal representation in the dataset composition. As demonstrated in Table 1, the proposed CVMH-UNet attains optimal metric scores for the mIoU, mF1, and mAcc. In contrast to the next best method, the mIoU improves by 0.92%, the mF1 by 0.60%, and the mAcc by 0.45%. Our approach garners the top scores in every category. The model achieved a 90.25% mIoU score on the large-scale building class, outperforming the closest competitor by 0.92%. The small-scale car category achieved an IoU score of 68.44%, outperforming the second-best method by 0.53%. The experimental results conclusively demonstrate that the CVMH-UNet achieves effective segmentation of multi-scale targets, fulfilling the critical requirements for high-resolution RS image segmentation in contemporary applications. A visual comparison of segmentation performance on the ISPRS Vaihingen dataset is provided in Figure 5, highlighting differences between our method and other approaches. Figure 5 demonstrates that CVMH-UNet achieves the superior preservation of complete contours for large-scale objects (e.g., buildings) while enabling the precise segmentation of individual vehicles within small-scale object clusters (e.g., car groups). In summary, CVMH-UNet demonstrates superior segmentation capabilities for multi-scale objects, outperforming the existing comparative methods.

5.2. Comparison with State-of-the-Art Methods on ISPRS Potsdam Dataset

To substantiate the effectiveness of our model, we conducted comparative experiments with other advanced methods on the larger ISPRS Potsdam dataset, with the results tabulated in Table 2. Consistent with the ISPRS Vaihingen dataset, the background class accuracy remains undisclosed. As demonstrated in Table 2, the proposed method achieves the highest IoU scores across all evaluated categories (with the exception of building class), while achieving the best performance in both the mIoU and mF1 metrics. In contrast to the next best method, the mIoU increases by 0.76% and the mF1 improves by 0.49%. Qualitative comparisons of the segmentation results between the proposed method and benchmark methods on the ISPRS Potsdam dataset are provided in Figure 6, offering visual evidence of performance disparities. The comparative analysis of Figure 6 demonstrates that CVMH-UNet preserves finer structural details in segmentation outputs, exhibiting higher fidelity to ground truth annotations. The proposed method generates segmentation boundaries with markedly sharper delineation, exhibiting significantly lower erroneous pixel assignment rates compared to benchmark approaches, which demonstrates its competitive superiority in precision.

5.3. Comparison with State-of-the-Art Methods on the GID-15 Dataset

To comprehensively validate the proposed method’s effectiveness, experiments are performed on the GID-15 dataset against state-of-the-art methods. This dataset contains complex scene compositions with 15 distinct land-cover categories, providing rigorous testing conditions. As shown in Table 3, CVMH-UNet achieves the best performance in terms of mIoU, mF1, and mAcc, surpassing the suboptimal method by 1.49% in mIoU, 0.86% in mF1, and 1.03% in mAcc. Notably, among the 16 categories in this dataset (including the background), our method achieves the highest accuracy in more than half of the classes. The experimental results demonstrate that the proposed CVMH-UNet significantly improves segmentation accuracy in large-scale complex scenarios. Figure 7 illustrates the segmentation results of different methods on the GID-15 dataset. It can be observed that CVMH-UNet not only delivers higher accuracy in large-scale region segmentation but also exhibits finer precision in handling intricate details. This visually reinforces the superior performance of CVMH-UNet compared to other methods.

5.4. Ablation Experiments

5.4.1. Effect of Each Module of CVMH-UNet

A comprehensive component-wise ablation analysis is conducted on the CVMH-UNet architecture using the ISPRS Vaihingen dataset to evaluate individual module contributions. We deliberately select the smallest dataset among the three for the ablation experiments because it offers a more rigorous testing environment. The dataset’s limited size and inherent data sparsity create a challenging scenario that compels modules to demonstrate their inherent efficacy under constrained information conditions. We assess the effectiveness of the proposed HVSSBlock and MFMSBlock by removing and adding the respective modules. Table 4 summarizes the results of the ablation comparisons, where “✓” and “×” indicate the presence and absence of the corresponding modules, respectively. For HVSSBlock, we specifically examine the effectiveness of the residual structure and the local branch within the module. For MFMSBlock, we focus on the effectiveness of the introduced multi-frequency information and the Adaptive 1D Conv. To clarify, the HVSSBlock is optimized and upgraded based on the VSSBlock. The first row in the table represents our baseline method using a U-Net architecture based on the VSSBlock.

Baseline + HVSSBlock: The HVSSBlock, constructed with integrated convolutional branches and an optimized scanning method, achieves a higher feature representation capability. Table 4 illustrates the deployment of HVSSBlock on the ISPRS Vaihingen dataset resulted in a 0.51% improvement in the mIoU, while only slightly increasing the model’s computational complexity. This indicates that our designed module can significantly enhance the feature extraction ability of the model while maintaining low computational complexity, thereby demonstrating the module’s effectiveness.

Baseline + MFMSBlock: The MFMSBlock achieves effective skip connection fusion by constructing a multi-frequency, multi-scale feature fusion mechanism. As shown in Table 4, MFMSBlock leads to a 0.34% improvement in the mIoU on the ISPRS Vaihingen dataset at a very low computational cost, which solidly establishes the module’s lightweight architecture and its proven effectiveness.

Baseline + HVSSBlock+MFMSBlock (CVMH-UNet): Combining HVSSBlock and MFMSBlock constructs the complete CVMH-UNet. As shown in Table 4, the combination of the two modules yields the highest mIoU score. Compared to the unmodified Baseline, CVMH-UNet significantly improves segmentation accuracy at the cost of only a slight increase in computational complexity. This demonstrates that the proposed HVSSBlock and MFMSBlock not only fulfill the needs of high-resolution RS image segmentation but also satisfy the demands of practical applications.

5.4.2. Effect of HVSSBlock

A component-wise ablation study is designed to validate the following three core improvements: (1) the hybrid convolutional branch enhancing VMamba’s local feature extraction capability; (2) the optimized CS2D scanning method enabling the more comprehensive capture of global information; and (3) residual skip connections optimizing feature propagation while minimizing information degradation. First, we compared the performance differences in global feature extraction between the original CS2D scanning method and our optimized CS2D scanning method. Table 5 quantifies the contribution of the CS2D scanning method, where a 0.22% mIoU improvement is achieved with the equivalent computational costs, validating its effectiveness in global feature extraction. Additionally, the other two main designs of HVSSBlock are the integration of the convolutional branch and the use of the residual structure. As quantified in Table 6, the hybrid convolutional branch integration yields a 0.12% mIoU improvement, demonstrating its capability to effectively address VMamba’s limitations in local feature extraction. The integration of the residual structure improved the mIoU by 0.14% while maintaining computational efficiency. This demonstrates its capability to enhance feature extraction and reduce information loss during processing. The combined implementation of the convolutional branch and residual structure yielded a 0.29% enhancement in mIoU. Considering both accuracy and computational complexity, we conclude that the combination of the two is the optimal choice.

5.4.3. Effect of the MFMSBlock

Ablation studies systematically validate the following two hypotheses: (1) The integration of multi-frequency information significantly enhances feature representation integrity. (2) The use of Adaptive 1D Conv enhances information transmission precision. As shown in Table 7, the introduction of multi-frequency information further improves the model’s segmentation accuracy, which fully demonstrates that incorporating multi-frequency information enhances information integrity and better captures complex information. Additionally, we compare the use of fully connected layers (FC) with the Adaptive 1D Conv for calculating channel weights of multi-frequency-aggregated features. The results show that Adaptive 1D Conv improves the mIoU by 0.26% compared to using fully connected layers, thereby verifying that Adaptive 1D Conv effectively reduces information loss from dimensionality reduction in fully connected layers and ensuring precise information transmission.

5.5. Model Complexities

We evaluated the computational complexity of all models using random tensors sized 3 × 256 × 256. Table 8 presents a comparison of computational complexity and parameter scale across all methods, all evaluated under identical experimental configurations. For comparative purposes, Table 8 also incorporates the mIoU scores of each model across the ISPRS Vaihingen, ISPRS Potsdam, and GID-15 datasets. As shown in Table 8, while CVMH-UNet does not achieve the lowest computational complexity, it notably features a complete VMamba architecture throughout both its encoder and decoder when compared to UNetFormer (the most computationally efficient model). UNetFormer incorporates Transformer architectures in only three decoder layers while utilizing CNN architectures in all other components. Although this hybrid design enables a lightweight framework, it demonstrates significantly inferior performance compared to CVMH-UNet. CVMH-UNet demonstrates significant computational efficiency advantages when compared with both Transformer-based networks (e.g., CMTFNet) and multi-head self-attention architectures (e.g., MANet). In particular, when compared to other VMamba-based networks such as Rs3Mamba and CM-UNet, CVMH-UNet achieves the highest segmentation accuracy. In summary, the proposed CVMH-UNet effectively reduces computational costs while maintaining high segmentation accuracy, achieving an optimal balance between precision and efficiency.

6. Conclusions

This paper proposes the CVMH-UNet for high-resolution RS image semantic segmentation, with the following two key modules: (1) The HVSSBlock, serving as the fundamental unit for encoder–decoder architecture, achieves the comprehensive exploration of global and local features by integrating convolutional branches and an optimized scanning method. This approach significantly enhances model accuracy and efficiency compared to conventional VSSBlocks. (2) The MFMSBlock replaces traditional skip connections by innovatively fusing multi-frequency domain information with Adaptive 1D Conv, enhancing information integrity and transmission precision to enable the refined fusion of encoder–decoder features. Experimental results on three benchmark high-resolution RS datasets demonstrate that CVMH-UNet achieves the best performance. The proposed model effectively processes large-scale imagery while maintaining an optimal balance between precision and computational complexity. Furthermore, extensive ablation studies validate the effectiveness of the proposed modules. Future work will focus on further structural optimization and exploring Vision Mamba’s potential in RS applications.

Author Contributions

Y.C., C.L., L.Z. and Z.W. Conceptualization; Y.C. and C.L. methodology; Y.C. and C.L. software; Y.C. and C.L. validation; L.Z., L.Y. and Z.W. formal analysis; Y.C. and C.L. investigation; Y.C. and C.L. resources; Y.C. and C.L. data curation; Y.C. and C.L. writing—original draft preparation; Z.W. writing—review and editing; C.L. visualization; Y.C. supervision; Z.W., L.Z. and L.Y. project administration; Y.C, Z.W., L.Z. and L.Y. funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported, in part, by the National Natural Science Foundation of China under Grant 62401007, 62471006, 62201007, 62001003, and Grant U23B2007; the Natural Science Foundation of Anhui Province under Grant 2308085QF199; the Open Project Funds for the Key Laboratory of Space Photoelectric Detection and Perception (Nanjing University of Aeronautics and Astronautics), Ministry of Industry and Information Technology under Grant NJ2024027-4; and the Fundamental Research Funds for the Central Universities under Grant NJ2024027.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Acknowledgments

The author would like to thank the anonymous reviewers for their comments and constructive suggestions for improving the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, Y.; Fan, B.; Wang, L.; Bai, J.; Xiang, S.; Pan, C. Semantic labeling in very high resolution images via a self-cascaded convolutional neural network. ISPRS J. Photogramm. Remote Sens. 2018, 122, 78–95. [Google Scholar] [CrossRef]
Yao, H.; Qin, R.; Chen, X. Unmanned Aerial Vehicle for Remote Sensing Applications—A Review. Remote Sens. 2019, 11, 1443. [Google Scholar] [CrossRef]
Yan, L.; Fan, B.; Liu, H.; Huo, C.I.; Xiang, S.; Pan, C. Triplet Adversarial Domain Adaptation for Pixel-Level Classification of VHR Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 3558–3573. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Wang, L.; Zhang, C. Land cover classification from remote sensing images based on multi-scale fully convolutional network. Geo-Spat. Inf. Sci. 2022, 25, 278–294. [Google Scholar] [CrossRef]
Liu, H.; Li, W.; Xia, X.; Zhang, M.; Gao, C.; Tao, R. Central Attention Network for Hyperspectral Imagery Classification. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 8989–9003. [Google Scholar] [CrossRef] [PubMed]
Zhang, M.; Li, W.; Zhang, Y.; Tao, R.; Du, Q. Hyperspectral and LiDAR Data Classification Based on Structural Optimization Transmission. IEEE Trans. Cybern. 2023, 53, 3153–3164. [Google Scholar] [CrossRef] [PubMed]
Shi, Z.; Fan, J.; Du, Y.; Zhou, Y.; Zhang, Y. LULC-SegNet: Enhancing Land Use and Land Cover Semantic Segmentation with Denoising Diffusion Feature Fusion. Remote Sens. 2024, 16, 4573. [Google Scholar] [CrossRef]
Zhao, J.; Du, D.; Chen, L.; Liang, X.; Chen, H.; Jin, Y. HA-Net for Bare Soil Extraction Using 8.Optical Remote Sensing Images. Remote Sens. 2024, 16, 3088. [Google Scholar] [CrossRef]
Zhang, Y.; Li, W.; Sun, W.; Tao, R.; Du, Q. Single-Source Domain Expansion Network for Cross-Scene Hyperspectral Image Classification. IEEE Trans. Image Process. 2023, 32, 1498–1512. [Google Scholar] [CrossRef]
Lian, R.; Wang, W.; Mustafa, N.; Huang, L. Road Extraction Methods in High-Resolution Remote Sensing Images: A Comprehensive Review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5489–5507. [Google Scholar] [CrossRef]
Zhao, X.; Wu, Z.; Chen, Y.; Zhou, W.; Wei, M. Fine-Grained High-Resolution Remote Sensing Image Change Detection by SAM-UNet Change Detection Model. Remote Sens. 2024, 16, 3620. [Google Scholar] [CrossRef]
Song, J.; Yang, S.; Li, Y.; Li, X. An Unsupervised Remote Sensing Image Change Detection Method Based on RVMamba and Posterior Probability Space Change Vector. Remote Sens. 2024, 16, 4656. [Google Scholar] [CrossRef]
Guo, Y.; Jia, X.; Paull, D. Effective Sequential Classifier Training for SVM-Based Multitemporal Remote Sensing Image Classification. IEEE Trans. Image Process. 2018, 27, 3036–3048. [Google Scholar] [CrossRef]
Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 2005, 26, 217–222. [Google Scholar] [CrossRef]
Krähenbühl, P.; Koltun, V. Efficient inference in fully connected CRFs with Gaussian edge potentials. Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) 2011, 9, 109–117. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. Available online: https://api.semanticscholar.org/CorpusID:225039882 (accessed on 1 March 2025).
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6877–6886. [Google Scholar] [CrossRef]
Wang, W.; Chen, W.; Qiu, Q.; Chen, L.; Wu, B.; Lin, B.; He, X.; Liu, W. Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3123–3136. [Google Scholar] [CrossRef] [PubMed]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
Ma, X.; Zhang, X.; Man-On, P. RS 3 Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Ding, H.; Xia, B.; Liu, W.; Zhang, Z.; Zhang, J.; Wang, X.; Xu, S. A Novel Mamba Architecture with a Semantic Transformer for Efficient Real-Time Remote Sensing Semantic Segmentation. Remote Sens. 2024, 16, 2620. [Google Scholar] [CrossRef]
Liu, M.; Dan, J.; Lu, Z.; Yu, Y.; Li, Y.; Li, X. CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation. arXiv 2024, arXiv:2405.10530. [Google Scholar]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 3559–3568. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P. Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Li, J.; Cheng, S. AFENet: An Attention-Focused Feature Enhancement Network for the Efficient Semantic Segmentation of Remote Sensing Images. Remote Sens. 2024, 16, 4392. [Google Scholar] [CrossRef]
Du, B.; Shan, L.; Shao, X.; Zhang, D.; Wang, X.; Wu, J. Transform Dual-Branch Attention Net: Efficient Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images. Remote Sens. 2025, 17, 540. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 763–772. [Google Scholar] [CrossRef]
Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. Localmamba: Visual state space model with windowed selective scan. arXiv 2024, arXiv:2403.09338. [Google Scholar]
Li, Y.; Luo, Y.; Zhang, L.; Wang, Z.; Du, B. MambaHSI: Spatial-Spectral Mamba for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.; Kweon, I. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. Available online: https://openaccess.thecvf.com/content_ECCV_2018/html/Sanghyun_Woo_Convolutional_Block_Attention_ECCV_2018_paper.html (accessed on 1 March 2025).
Roy, A.; Navab, N.; Wachinger, C. Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation” blocks. IEEE Trans. Med. Imaging 2019, 38, 540–549. [Google Scholar] [CrossRef] [PubMed]
Lee, H.; Kim, H.; Nam, H. Srm: A style-based recalibration module for convolutional neural networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1854–1862. [Google Scholar] [CrossRef]
Zhou, N.; Hong, J.; Cui, W.; Wu, S.; Zhang, Z. A Multiscale Attention Segment Network-Based Semantic Segmentation Model for Landslide Remote Sensing Images. Remote Sens. 2024, 16, 1712. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Li, L.; Xu, N.; Liu, F.; Yuan, C.; Chen, Z.; Lyu, X. AAFormer: Attention-Attended Transformer for Semantic Segmentation of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Wang, S.; Hu, Q.; Wang, S.; Zhao, P.; Li, J.; Ai, M. Category attention guided network for semantic segmentation of Fine-Resolution remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2024, 127, 103661. [Google Scholar] [CrossRef]
Liu, Y.; Bai, X.; Wang, J.; Li, G.; Li, J.; Lv, Z. Image semantic segmentation approach based on DeepLabV3 plus network with an attention mechanism. Eng. Appl. Artif. Intell. 2024, 127, 107260. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar] [CrossRef]
Liu, B.; Li, B.; Sreeram, V.; Li, S. MBT-UNet: Multi-Branch Transform Combined with UNet for Semantic Segmentation of Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 16, 2776. [Google Scholar] [CrossRef]
Hu, Z.; Qian, Y.; Xiao, Z.; Yang, G.; Jiang, H.; Sun, X. SABNet: Self-Attention Bilateral Network for Land Cover Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8559–8569. [Google Scholar] [CrossRef]
Liu, J.; Hua, W.; Zhang, W.; Liu, F.; Xiao, L. Stair Fusion Network With Context-Refined Attention for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
Sang, M.; Hansen, J.H.L. Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2012; pp. 321–325. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer meets convolution: A bilateral awareness network for semantic segmentation of very fine resolution urban scene images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive bilateral contextual network for efficient semantic segmentation of Fine-Resolution remotely sensed imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
Li, R.; Wang, L.; Zhang, C.; Duan, C.; Zheng, S. A2-FPN for semantic segmentation of fine-resolution remotely sensed images. Int. J. Remote Sens. 2022, 43, 1131–1155. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C. Multistage attention ResU-Net for semantic segmentation of fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and multiscale transformer fusion network for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]

Figure 1. Comparison of two different scanning methods. (a) Four scanning paths of SS2D. (b) Four scanning paths of CS2D.

Figure 2. (a) Overall network structure of the CVMH-UNet. (b) Illustration of the HVSSBlock. (c) Overall architecture of the MFMSBlock. (d) Detailed structure of the HVSSLayer.

Figure 3. (a) Detailed structure of the CS2D. (b) Detailed structure of the CrossMamba. (c) Detailed structure of the E-FNN.

Figure 4. Detailed structure of the MFMS-AM.

Figure 5. Segmentation results of different methods on the ISPRS Vaihingen dataset.

Figure 6. Segmentation results of different methods on the ISPRS Potsdam dataset.

Figure 7. Segmentation results of different methods on the GID-15 dataset.

Table 1. Quantitative comparison on the ISPRS Vaihingen dataset. The accuracy for each class is presented in the form of IoU (%). The optimal results are presented in bold.

Method	Imp.surf.	Building	Lowveg.	Tree	Car	mIoU (%)	mF1 (%)	mAcc (%)
BANet [52]	71.57	80.84	52.02	70.66	45.54	64.13	77.33	77.47
ABCNet [53]	77.52	88.09	56.94	75.23	61.46	71.85	83.11	84.80
UNetFormer [22]	76.75	87.18	57.20	74.49	57.54	70.63	82.33	83.74
$A^{2}$ -FPN [54]	78.56	87.12	59.12	75.12	62.81	72.55	83.67	84.41
MAResU-Net [55]	80.27	89.33	59.87	75.50	65.28	74.05	84.67	85.19
MANet [30]	79.28	88.10	59.43	76.14	67.62	74.11	84.76	85.37
CMTFNet [56]	81.17	89.31	61.00	75.86	67.91	75.05	85.38	85.04
Rs3Mamba [26]	79.78	88.18	58.70	75.65	63.58	73.18	84.06	84.30
CM-UNet [28]	79.77	89.10	59.43	75.62	66.56	74.10	84.72	84.55
CVMH-UNet (ours)	81.92	90.25	62.11	77.13	68.44	75.97	85.98	85.82

Table 2. Quantitative comparison on the ISPRS Potsdam dataset. The accuracy for each class is presented in the form of IoU (%). The optimal results are presented in bold.

Method	Imp.surf.	Building	Lowveg.	Tree	Car	mIoU (%)	mF1 (%)	mAcc (%)
BANet [52]	73.80	80.25	63.48	58.86	74.23	70.12	82.19	82.56
ABCNet [53]	81.94	90.11	71.89	73.64	82.75	80.07	88.78	88.61
UNetFormer [22]	82.00	89.41	71.08	71.18	82.41	79.22	88.23	88.16
$A^{2}$ -FPN [54]	82.54	90.55	71.78	72.76	82.82	80.09	88.77	88.69
MAResU-Net [55]	82.07	90.73	71.76	72.36	83.87	80.16	88.81	88.75
MANet [30]	82.36	90.95	71.66	72.59	83.34	80.23	88.85	89.03
CMTFNet [56]	82.49	90.48	71.81	73.23	83.07	80.22	88.86	88.65
Rs3Mamba [26]	82.17	89.83	71.28	72.49	82.76	79.71	88.54	87.99
CM-UNet [28]	82.37	90.66	71.45	72.94	83.19	80.12	88.79	88.48
CVMH-UNet (ours)	83.40	90.70	72.91	73.97	83.99	80.99	89.35	88.96

Table 3. Quantitative comparison on the GID-15 dataset. The accuracy for each class is presented in the form of IoU (%). The optimal results are presented in bold.

Method	Bac. *	Ind. *	Urb. *	Rur. *	Tra. *	Pad. *	Irr. *	Dry. *	Gar. *	Arb. *	Shr. *	Nat. *	Art. *	River	Lack	Pond	mIoU (%)	mF1 (%)	mAcc (%)
BANet [52]	60.59	50.86	61.43	51.78	43.07	59.08	73.71	54.71	25.20	75.29	13.78	55.11	45.80	86.66	79.56	70.04	56.67	70.34	70.14
ABCNet [53]	65.58	56.56	66.37	53.00	55.63	66.05	76.86	58.21	32.47	75.02	1.96	58.15	27.22	79.03	82.22	73.67	58.00	70.65	68.14
UNetFormer [22]	65.25	58.09	65.09	52.85	56.91	66.21	78.23	60.76	31.85	76.31	4.87	59.74	50.29	88.03	74.18	72.31	60.06	72.79	75.04
$A^{2}$ -FPN [54]	63.77	55.64	65.72	53.14	54.82	65.07	77.46	59.37	33.43	75.93	2.92	58.39	32.20	90.83	79.98	74.01	58.92	71.45	71.47
MAResU-Net [55]	65.16	52.41	65.45	53.80	55.94	64.95	76.52	60.70	28.31	76.19	13.50	61.42	47.14	90.96	82.18	72.98	60.48	73.37	75.24
MANet [30]	65.95	53.72	65.69	52.10	57.54	67.65	77.92	57.64	26.29	76.34	6.09	62.23	52.07	92.48	80.13	75.64	60.59	72.98	74.57
CMTFNet [56]	65.26	53.40	63.89	52.63	54.81	64.37	78.36	62.23	39.45	76.30	9.00	62.07	41.26	90.16	74.12	71.81	59.95	72.97	73.54
Rs3Mamba [26]	64.85	54.63	65.81	53.19	54.61	66.86	77.37	64.15	41.40	75.00	12.27	59.75	45.73	88.95	79.85	73.44	61.09	74.09	75.82
CM-UNet [28]	64.16	55.65	64.45	51.46	51.77	63.65	77.12	59.69	29.34	76.35	10.69	59.95	55.66	89.96	78.77	73.35	60.13	73.07	75.37
CVMH-UNet (ours)	66.98	57.52	65.84	53.00	59.09	68.76	79.92	65.65	33.18	76.74	11.95	62.04	48.41	93.33	82.97	75.84	62.58	74.95	76.85

* The names of all categories have been abbreviated, and the specific classes can be found in Figure 7.

Table 4. Results of the ablation study comparison. The optimal results are presented in bold.

VSSBlock *	HVSSBlock	MFMSBlock	mIoU (%)	FLOPs (Gbps)	Params (Mb)
✓	×	×	75.08	4.09	22.04
×	✓	×	75.59	5.61	30.44
×	×	✓	75.42	4.18	22.43
×	✓	✓	75.97	5.71	30.84

* VSSBlock serves as the baseline for the feature extraction module HVSSBlock.

Table 5. Comparison of different scanning strategies. The optimal results are presented in bold.

SS2D	CS2D	mIoU (%)	FLOPs (Gbps)	Params (Mb)
✓	×	75.08	4.09	22.04
×	✓	75.30	4.09	22.04

Table 6. Ablation study results for each module of the HVSSBlock. The optimal results are presented in bold.

Local Branch	Residual Block	mIoU (%)	FLOPs (Gbps)	Params (Mb)
×	×	75.30	4.09	22.04
✓	×	75.42	5.61	30.44
×	✓	75.44	4.09	22.04
✓	✓	75.59	5.61	30.44

Table 7. Ablation results for each module of the MFMSBlock. The optimal results are presented in bold.

MS *	MF *	Adaptive 1D Conv	FC *	mIoU (%)	FLOPs (Gbps)	Params (Mb)
✓	×	×	✓	75.63	5.69	31.24
✓	✓	×	✓	75.71	5.71	30.94
✓	✓	✓	×	75.97	5.71	30.84

* MS represents multi-scale, MF represents multi-frequency, and FC represents fully connected.

Table 8. Comparison of computational complexity between different methods. The optimal results are presented in bold.

Method	mIoU (%) *	FLOPs (Gbps)	Params (Mb)
BANet citeb41	64.13/70.12/56.67	3.26	12.72
ABCNet [53]	71.85/80.17/58.00	3.91	13.39
UNetFormer [22]	70.63/79.22/60.06	2.94	11.68
$A^{2}$ -FPN [54]	72.55/80.09/58.92	10.46	22.82
MAResU-Net [55]	74.05/80.16/60.48	8.78	23.27
MANet [30]	74.11/80.23/60.59	19.45	35.86
CMTFNet [56]	75.05/80.22/59.95	8.57	30.07
Rs3Mamba [26]	73.18/79.71/61.09	15.82	49.66
CM-UNet [28]	74.10/80.12/60.13	3.17	13.55
CVMH-UNet (ours)	75.97/80.99/62.58	5.71	30.84

* The three values in the mIoU column represent the model’s experimental results on the ISPRS Vaihingen Dataset, ISPRS Potsdam Dataset, and GID-15 Dataset.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, Y.; Liu, C.; Wu, Z.; Zhang, L.; Yang, L. Remote Sensing Image Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion. Remote Sens. 2025, 17, 1390. https://doi.org/10.3390/rs17081390

AMA Style

Cao Y, Liu C, Wu Z, Zhang L, Yang L. Remote Sensing Image Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion. Remote Sensing. 2025; 17(8):1390. https://doi.org/10.3390/rs17081390

Chicago/Turabian Style

Cao, Yice, Chenchen Liu, Zhenhua Wu, Lei Zhang, and Lixia Yang. 2025. "Remote Sensing Image Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion" Remote Sensing 17, no. 8: 1390. https://doi.org/10.3390/rs17081390

APA Style

Cao, Y., Liu, C., Wu, Z., Zhang, L., & Yang, L. (2025). Remote Sensing Image Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion. Remote Sensing, 17(8), 1390. https://doi.org/10.3390/rs17081390

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Remote Sensing Image Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion

Abstract

1. Introduction

2. Related Work

2.1. Vision State Space Models

2.2. Attention Mechanisms in Deep Learning

2.3. Skip Connections in Deep Learning

3. Methodology

3.1. Overall Architecture

3.2. HVSSBlock

3.3. MFMSBlock

3.4. Remote Sensing Image Segmentation Based on CVMH-UNet

4. Dataset and Experimental Setting

4.1. Datasets

4.1.1. ISPRS Vaihingen Dataset

4.1.2. ISPRS Potsdam Dataset

4.1.3. Gaofen Image Dataset with 15 Categories (GID-15)

4.2. Experimental Setting

4.3. Evaluation Metrics

4.4. Loss Functions

5. Experimental Results and Analysis

5.1. Comparison with State-of-the-Art Methods on the ISPRS Vaihingen Dataset

5.2. Comparison with State-of-the-Art Methods on ISPRS Potsdam Dataset

5.3. Comparison with State-of-the-Art Methods on the GID-15 Dataset

5.4. Ablation Experiments

5.4.1. Effect of Each Module of CVMH-UNet

5.4.2. Effect of HVSSBlock

5.4.3. Effect of the MFMSBlock

5.5. Model Complexities

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI