Next Article in Journal
Change in Fractional Vegetation Cover and Its Prediction during the Growing Season Based on Machine Learning in Southwest China
Previous Article in Journal
Supervised Learning-Based Prediction of Lightning Probability in the Warm Season
Previous Article in Special Issue
Fast Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images via Score Map and Fast Transformer-Based Fusion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Link Aggregation for Skip Connection–Mamba: Remote Sensing Image Segmentation Network Based on Link Aggregation Mamba

1
School of Information Science and Technology, Northwest University, Xi’an 710127, China
2
School of Arts and Communication, Beijing Normal University, Beijing 100875, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(19), 3622; https://doi.org/10.3390/rs16193622
Submission received: 12 June 2024 / Revised: 5 September 2024 / Accepted: 13 September 2024 / Published: 28 September 2024
(This article belongs to the Special Issue Deep Learning for Satellite Image Segmentation)

Abstract

:
The semantic segmentation of satellite and UAV remote sensing imagery is pivotal for address exploration, change detection, quantitative analysis and urban planning. Recent advancements have seen an influx of segmentation networks utilizing convolutional neural networks and transformers. However, the intricate geographical features and varied land cover boundary interferences in remote sensing imagery still challenge conventional segmentation networks’ spatial representation and long-range dependency capabilities. This paper introduces a novel U-Net-like network for UAV image segmentation. We developed a link aggregation Mamba at the critical skip connection stage of UNetFormer. This approach maps and aggregates multi-scale features from different stages into a unified linear dimension through four Mamba branches containing state-space models (SSMs), ultimately decoupling and fusing these features to restore the contextual relationships in the mask. Moreover, the Mix-Mamba module is incorporated, leveraging a parallel self-attention mechanism with SSMs to merge the advantages of a global receptive field and reduce modeling complexity. This module facilitates nonlinear modeling across different channels and spaces through multipath activation, catering to international and local long-range dependencies. Evaluations on public remote sensing datasets like LovaDA, UAVid and Vaihingen underscore the state-of-the-art performance of our approach.

1. Introduction

With the continuous advancements in sensor technology, urban remote sensing imagery captured by satellites and unmanned aerial vehicles (UAVs) now exhibits higher resolution and greater detail. The classification of geographic features in remote sensing images, namely pixel-level semantic segmentation, has extensively influenced various urban planning applications such as change detection [1,2,3], environmental monitoring [4,5,6,7], cartography [8,9] and traffic surveillance [10,11]. Recent studies have predominantly focused on deep learning frameworks based on convolutional neural networks [12,13,14,15,16,17] and transformers [18,19,20,21,22,23,24], employing extensive datasets for supervised learning to generalize applications to real-world scenarios.
Compared to traditional image segmentation algorithms such as region growing [25,26], fuzzy set theory [27] and conditional random fields [28,29], deep-learning-based methods excel in capturing the global and local context of input images through progressively expanding receptive fields. They utilize data-driven approaches to delve deeply into expert prior knowledge, forming an end-to-end segmentation paradigm. The primary frameworks are divided into CNNs and attention mechanisms. CNN-based segmentation networks continually stack two-dimensional convolutional kernels to build fully convolutional feature reconstruction networks that map images to masks, exemplified by architectures like U-Net [15] and PSPNet [16]. Attention mechanisms capture the long-range dependencies of images globally or locally to reconstitute the relationships between spatial and channel contexts, thus focusing on semantic responses, as seen in models like Segmenter [30]. Recent studies indicate that segmentation networks combining CNNs with self-attention mechanisms leverage the strengths of both, serializing the representational outcomes of CNNs and using transformers to construct spatial relationships between tokens for more precise semantic localization and boundary delineation, with ViT [31] and UnetFormer [32] being the most representative models.
Remote sensing images, due to their complex and variable land cover scenes, multispectral channels and high spatiotemporal resolution, pose significant challenges to the representational capacity of segmentation models. Pure CNN-based approaches can capture fine-grained local context but are constrained by their limited receptive fields, making it difficult to learn complex representations. This limitation prevents them from fully capturing cross-regional semantic information in irregular large-scale land scenes. Although some methods employ global pooling modules to supplement missing panoramic class information, these modules can interfere with the recognition of fine-grained local details. Transformer models, through self-attention mechanisms, compute sequential feature responses and construct long-range dependencies, enabling the accurate mapping of cross-class geographical positions. However, as the input sequence length or network depth increases, the fully connected multi-head self-attention mechanism incurs higher computational complexity, greatly limiting both model efficiency and depth. This constraint becomes especially problematic when processing high-resolution remote sensing images, as computational resources and efficiency are strained. Recent studies have combined CNNs and transformers in hybrid architectures [33,34,35] to mitigate the limitations of individual modules, but regardless of whether these modules are sequentially or interleaved, the computational cost remains high, and the generalization and robustness on small-scale datasets or sparse features are limited. On the other hand, cross-scale connections, which fuse low-level spatial details from the encoder with high-level semantic information from the decoder, enable better handling of objects with complex boundaries, facilitating both localization and detail refinement. As a result, they are widely used in remote sensing and medical segmentation. However, current connections simply concatenate features at matching scales, which introduces low-dimensional textures into high-dimensional features, placing additional pressure on the decoder’s representational capacity.
The state-space model (SSM) establishes long-range dependencies through state transitions and utilizes convolutional computations to perform these transitions, achieving near-linear computational complexity. Mamba [36] enhances efficiency during training and inference by incorporating time-varying parameters and constructing standard SSM modules. Vim [37] and VMamba [38] extend Mamba technology from processing one-dimensional sequential data to two-dimensional visual domains, where they have been proven to outperform transformer architectures in performance and various downstream tasks. Jamba [39] has innovatively restructured the hybrid expert architecture by interweaving transformers with Mamba layers, enabling flexible parameter management tailored to various large-scale foundational language tasks. HC-Mamba [40] integrates depthwise separable convolution and dilated convolution into Mamba, creating the hc-ssm module to capture a broader range of contextual information, thereby processing large-scale medical image data at a lower computational cost. MxT [41] enhances pixel-level interactive learning through the complementary synergy of Mamba and transformers, improving the model’s performance in high-quality image reconstruction.
This paper proposes a U-Net-like network for remote sensing image segmentation. The overall framework integrates two efficient state-space feature processing models, LASC-Mamba and Mix-Mamba, within an architecture based on a ResNet18 encoder and a symmetric decoder with multi-head attention mechanisms. Specifically, LASC-Mamba serializes the output from the ResNet18 feature layers, applies positional encoding and forms global modeling through shared parameters. It adaptively facilitates cross-resolution interactions of land cover features via cascading, thereby avoiding the limitations of the direct concatenation of single-scale features. The Mix-Mamba module reconstructs semantics by combining local convolutional features, global semantic multi-head attention and Mamba branches with the corresponding scale’s encoder outputs. This enhances the decoder’s ability to integrate multi-dimensional information from both the LASC-Mamba output and the previous feature outputs, capturing information invariant to diverse land covers. The two Mamba modules are inserted into the network’s fusion and decoding layers, achieving more complex mask mapping through cross-scale and multi-path representations.
Our principal contributions are as follows:
  • We propose a SSM-based link aggregation method, LASC-Mamba, to facilitate semantic exchange between the encoder and decoder. This approach enhances the cross-scale representation capability during the skip-layer stages in remote sensing segmentation networks.
  • During the decoding phase, a dual-pathway Mix-Mamba is employed to activate insensitive information across different spatial and sequential dimensions, thereby augmenting the capability for high-level image understanding.
  • Comparative and ablation experiments are conducted on two public remote sensing image segmentation datasets. The results demonstrate that our method outperforms traditional CNN- and transformer-based segmentation approaches.

2. Related Work

In recent years, the advancement of semantic segmentation in remote sensing imagery has benefited from iterative updates to transformer architectures and U-net-like networks. Recent studies have introduced Mamba architecture to the field of computer vision.

2.1. U-Net-like Network

The U-Net network, with its symmetrical encoder–decoder architecture, was introduced by [15] to extract multilevel features by progressively downsampling spatial resolution during the encoding phase and gradually restoring resolution through skip connections in the decoding phase, and has been widely used for high-precision semantic segmentation. Several works, such as Res-UNet [42], Dense-UNet [43], UNet++ [44] and UNet3+ [45], have improved U-Net performance. MFU-Net [46] integrates a new multi-scale feature extraction module that captures shallow information across different scales. TreeUNet [47] utilizes a confusion matrix to construct tree–CNN blocks that merge multi-scale features and adaptively enhance pixel-level classification accuracy. Reunite-a [42] expands the U-Net baseline with a rich mix of optimization strategies, including residual connections, pyramid scene parsing, pooling and multitask inference. However, U-Net based on convolutional neural networks is limited by the convolutional kernel’s receptive field and lacks the capability for holistic image modeling. Many researchers have turned to transformers. Chen et al. [20] introduced TransUNet, which replaces CNNs with a multilayer transformer at the encoding layer but retains a CNN architecture at the decoding layer. Cao et al. [19] proposed Swin-Unet, a purely transformer-based U-Net-like network. Wang et al. [32] introduced UNetFormer, employing it as a decoder to mine image information and implementing an efficient global–local attention mechanism. Although transformers enhance the capacity to handle long-range dependencies, they possess quadratic complexity, which is costly for image segmentation tasks that typically feature high resolution.

2.2. Hybrid Architecture of CNN and Transformer

To address the global semantic deficiencies of CNNs and reduce the computational cost of transformers, recent research has relied on hybrid architectures combining CNNs with transformers. For instance, DETR [48] utilizes CNN-extracted low-resolution feature maps, reshaping them into feature sequences and applying positional encoding. This allows the transformer to linearly represent the output sequence, reducing the input size to the transformer while enhancing model learning speed and overall performance. However, compared to region-based feature extraction models, DETR, based on self-attention mechanisms, is less sensitive to small object details. In contrast, ViT-FRCNN [49] serially integrates faster R-CNN [50] with ViT, reconstructing the object detection network by reconfiguring ViT’s linear output into a two-dimensional feature input for the convolutional residual layers in the detection phase, enabling image object detection with transformers. MobileViT [51] and MobileViTV2 [35] achieve an alternating integration of CNNs and transformers through the design of lightweight MobileViT modules and inverted residual blocks, leveraging inductive bias to share sequential and spatial positional information of image features. STT [34] constructs a dual-path transformer architecture that uses a sparse token sampler to learn long-range dependencies in both spatial and channel dimensions in parallel, improving building extraction accuracy while reducing time costs. However, the effectiveness of this model on remote sensing datasets is constrained by the number of sparse token samplers and requires extensive experimentation to accumulate tuning knowledge. These approaches rely on various mixed modes of convolutional and self-attention mechanisms but still do not address the transformer’s insensitivity to local information in high-dimensional semantic decoding stages. Additionally, compared to pure convolutional models, these hybrid models exhibit limited generalization performance on small-scale datasets and sparse information.

2.3. Mamba Structure

The state-space model (SSM) is a foundational scientific model in modern control theory of linear definite systems that maps a one-dimensional function or sequence x ( t ) R to an output y ( t ) R , and can be represented by the following linear ordinary differential equation (ODE):
h ( t ) = A h ( t ) + B x ( t )
y ( t ) = C h ( t )
where y ( t ) R is from input signal x ( t ) R and implicit latent state h R N . A R N × N denotes the state transition matrix, and B R N × 1 and C R N × 1 are the projection matrix, representing its parameters. However, the method has prohibitive computational and memory requirements compared to the equivalent CNNs, which limits its application to deep learning tasks. Structured state-space sequence models (S4) [52] utilize a high-order polynomial projection operator (HIPPO) [53] to impose structured forms on state transition matrix A, building a deep sequence model with efficient remote reasoning capabilities. Recent work, Mamba [36], employs an input-dependent selection mechanism to filter non-valuable information from the input and develops a hardware-aware algorithm, increasing speed. In addition, Mamba demonstrates state-of-the-art performance and significant computational efficiency by merging SSM blocks with linear layers.
Overall, MAMBA and S4 implemented the continuous system described in Equations (1) and (2) in a discrete form and integrated it into a deep learning approach. A and B are discretized by a zeroth-order hold (ZOH) with the time scale parameter Δ . The process is shown below:
A ¯ = exp ( Δ A )
B ¯ = ( Δ A ) 1 ( exp ( Δ A ) I ) · Δ B
After discretization, the state-space model (SSM) can be further computed in two ways: a linear recurrence relation (Equations (5) and (6)) and a global convolution operation (Equations (7) and (8)).
h ( t ) = A ¯ h ( t ) + B ¯ x ( t )
y ( t ) = C h ( t )
K ¯ = ( C B ¯ , C AB ¯ , , C A ¯ L 1 B ¯ )
y = x K ¯
where K ¯ R L denotes a structured convolutional kernel and L is the length of the input sequence.
Recent research has applied Mamba to various downstream tasks in remote sensing imagery. Xiao [54] designed a frequency-assisted Mamba framework (FMSR) that integrates a frequency selection module, a visual state-space module and a hybrid gate module. This framework complements the global and local dependencies of low-resolution features in super-resolution tasks while reducing computational cost and complexity. CDMamba [55] introduced the scaled residual ConvMamba (SRCM) to address Mamba’s shortcomings in remote sensing change detection, such as the lack of detail and challenges in fine detection, dynamically enabling dual-temporal feature guidance. Zhu [56] replaced the multi-head self-attention mechanism in ViT with the Mamba-based Samba module as an encoder structure, combining it with UperNet’s decoder for efficient multi-level semantic information extraction in remote sensing imagery.

3. Method

In this section, we present the design of a U-net-like network for remote sensing image segmentation, incorporating two Mamba modules. Section 3.1 outlines the overall architecture of the proposed network, while Section 3.2 and Section 3.3 provide detailed descriptions of the implementation processes and underlying principles of LASC-Mamba and Mix-Mamba, respectively.

3.1. Network Architectures

The overall structure of the remote sensing image segmentation network developed in this paper is depicted in Figure 1. In the encoder phase, we initially utilize the ResNet18 network as the feature extraction layer, constructing convolutional outputs at varying scales through four downsampling stages. Subsequently, in the skip connection stage of the UnetFormer architecture, outputs from four different scale intermediate layers are linearly reconstructed using the Mamba architecture, and the reconstructed results are aggregated and linked to the corresponding fusion layer of the encoder. Finally, in the decoding phase, Mamba modules are configured in parallel on each global–local transformer module, merging the two structures to share dependencies.

3.2. Link Aggregation for Skip Connection Mamba (LASC-Mamba)

Mamba has achieved unprecedented results in the serialization of discrete data sequences. Given that an image represents two-dimensional discrete data sampled from a continuous signal, the experience with transformers suggests that two-dimensional input data can be transformed into one-dimensional sequences using encoded embeddings. Therefore, we capitalize on Mamba’s linear scaling advantage to enhance UnetFormer’s ability to perform contextual reconstruction during skip connection processes. The LASC-Mamba skip connection structure, constructed between the CNN encoder and decoder, is illustrated in Figure 2. The architecture comprises multi-stage Mamba state-space encoding, link aggregation and feature restoration layers.
In the structure discussed in this paper, Mamba extends two-dimensional image features into one-dimensional sequential features while employing a dual-pathway linear encoding structure to capture long-range dependencies. As shown in the first branch of Figure 2, given the output features I R B × C × H × W from a convolutional layer, we directly flatten these into a vector I R B × L × C , where B denotes the batch size, L represents the feature length, C indicates the channel count and L = H × W . After layer normalization and processing through linear layers in two parallel branches, the feature dimensions expand to (B, 2L, C). In the first branch, the output of the linear layer undergoes processing through a one-dimensional convolutional layer, a SiLU activation function [57] and an SSM layer, while the second branch utilizes only the SiLU activation function. Subsequently, the features from both branches are combined via the Hadamard product and dimensionally reduced in the final linear layer. Ultimately, the features are reshaped to the input dimensions (B, C, H, W). The Mamba module constructed in this section can be described as follows:
M a m b a ( I ) = L ( S S M ( σ ( C o n v ( L ( I ) ) ) ) σ ( L ( I ) ) )
where σ denotes the SiLU activation function, S S M represents the state-space model and ∘ denotes the Hadamard product operation performed between the former and the latter. L denotes the linear layer and C o n v denotes the one-dimensional convolutional layer.
In the UnetFormer, the link aggregation structure aggregates Mamba output features at four scales, corresponding to multilevel resolutions, allowing each stage’s path output to share global information from other links. The approach involves maintaining the same B and L while altering the channel count for stacking. To ensure that the final aggregated dimensions are multiples of the original stage dimensions, outputs at one-eighth size are duplicated. The final dimensions of the link aggregation layer are (B, L/16, C × 32). Channel reduction is performed through convolution at each stage to (B, L, C), followed by a linear transformation to restore the features to two-dimensional form (B, C, H, W). In order to mitigate potential noise disturbances and incomplete information from a single Mamba architecture, the original skip connections in UnetFormer are preserved, and the link aggregate output for each stage is represented as follows:
O i = I i + δ ( C o n v ( i = 1 4 M a m b a ( L N ( I i ) ) ) )
where O i denotes the skip connection output at the i-th stage, I i denotes the feature input at the i-th stage, i = 1 4 denotes link aggregation from the first to the fourth stage outputs, δ denotes linear scale transformation and C o n v refers to a one-dimensional convolution with a kernel size of 1 and a stride of 1.

3.3. Mix-Mamba

Our mix module implements a parallel Mamba module design based on the global–local transformer self-attention mechanism. It integrates global window semantic information, local two-dimensional feature information and state-space sequence information. This design addresses limited self-attention receptive fields and suboptimal modeling capabilities. The overall structure of mix mamba is shown in Figure 3.
In the architectural details, convolutional layers execute the local attention module parallel to kernel sizes 1 and 3. This configuration considers that larger kernel sizes could impede computational efficiency. Subsequently, batch normalization is applied, and the output results are summed. The structure can be represented as follows:
L B ( I ) = β ( C o n ν 1 ( I ) ) + β ( C o n ν 3 ( I ) )
where β denotes the batch normalization operation while C o n v 1 and C o n v 2 represent two-dimensional convolutions with kernel sizes of 1 and 3, respectively.
On the other hand, the global attention mechanism module constructs the consistency of the overall remote sensing scene through a window-based multi-head self-attention mechanism. Contextual information from two-dimensional windows is computed and fused separately via a cross-window context interaction module. The structure is as follows:
G B ( I ) = a v e p o o l 1 ( δ ( Q K T d k ) V ) + a v e p o o l 2 ( δ ( Q K T d k ) V )
where a v e p o o l 1 and a v e p o o l 2 denote average pooling across two different dimensions, while δ denotes the s o f t m a x function.
The Mamba module used in the Mamba branch is consistent with the previous section, featuring excellent spatial localization capabilities. Through the SSM, this architecture selectively compresses two-dimensional features to address content awareness of the input information. This parallel and hierarchical design enables the segmentation network’s decoder to capture global and local contexts across multiple scales while maintaining high efficiency. Consequently, the output of Mix-Mamba can be represented as follows:
M i x ( I ) = L B ( I ) + G B ( I ) + M a m b a ( I )
The Mix-Mamba module introduced in this section incorporates a multi-head self-attention mechanism, convolutional local feature extraction and Mamba global state-space mapping. Combining these three components’ advantages, it autonomously shares the necessary sparse features when addressing complex and variable geographical environments. These complementary sparse features restore essential contextual information, producing refined geographical mapping results.

4. Experiments

We conducted comparative and ablation experiments on the LoveDA and UAVid datasets to validate the effectiveness of LASC-Mamba and Mix-Mamba. Initially, we described the characteristics and training details of these two public datasets. Subsequently, we compared our model with advanced semantic segmentation algorithms from recent years to evaluate its performance. Finally, ablation studies were performed on the two Mamba modules to substantiate our theoretical claims.

4.1. Dataset

UAVid: The UAVid dataset [58] is a fine-resolution unmanned aerial vehicle semantic segmentation dataset for urban street scenes. The dataset comprises 42 sequences, totaling 420 individual images, across eight semantic categories. The images are divided into three subsets: a training set with 200 images, a validation set with 70 images and an official test set with 150 images for benchmarking purposes. Following [32], we preprocess each image from the dataset by padding it and cropping it into eight patches with 1024 × 1024 pixels.
LoveDA: The LoveDA dataset [59] contains high-quality optical remote sensing imagery (GDS 0.3 m) of three cities, divided into rural and urban scenarios covering seven landcover classes: background, building, road, water, barren, forest and agricultural. The dataset consists of 5987 images with a resolution of 1024 × 1024. The train set comprises 2522 images, the validation set comprises 1669 images and the test set is provided by the official government and contains 1796 images.
Vaihingen: The Vaihingen dataset consists of 33 TOP image tiles with a very fine spatial resolution, mainly depicting urban scenes. However, the images have three bands NIR-R-G with an approximate spatial resolution of 8 cm. We follow Unetformer, and selected the designated 16 images for training and the remaining 17 for testing.
We processed the UAVid dataset into 1024 × 1024 input images and applied data augmentation during training with random vertical flips, horizontal flips and brightness adjustments. During testing, test-time augmentation strategies such as vertical and horizontal flips were employed.
For the LoveDA and Vaihingen datasets, images were randomly cropped to 512 × 512 patches. Training involved random scaling with ratios of 0.5, 0.75, 1.0, 1.25 and 1.5, along with data augmentation techniques such as random vertical flips, horizontal flips and rotations. In the testing phase, multi-scale and random flip augmentations were utilized.

4.2. Training Detail

We implemented the LASC-Mamba and Mix-Mamba modules on the UNetformer baseline using the PyTorch framework. All experiments were conducted on a single NVIDIA RTX 4090 GPU. The proposed method utilized the AdamW optimizer [60] with a weight decay of 0.01, a batch size of 8 and 100 epochs. The learning rate was adjusted using the CosineAnnealingLR strategy to prevent overfitting during training. The initial learning rate for the pre-trained ResNet18 backbone was set to 6 e 5 , while other modules started at 6 e 4 to optimize more quickly to the same gradient as the backbone. The final learning rate was gradually reduced to its minimum at the max epoch, ensuring smooth parameter updates and improved model generalization.

4.3. Comparison with Other Methods

To validate the efficacy of the proposed method, we conducted both qualitative comparisons and quantitative evaluations. We compared the approach described in this paper with various methods and generated visualizations for analysis. The comparative methods can be broadly categorized into three groups: CNN-based semantic segmentation networks such as Banet [61], ABCNet [62] and MANet [63], and transformer-based networks like TransUNet [20], Segmenter [30], CMTFNet [64], DC-Swin [21] and UnetFormer [32].
Banet [61]: Building on the foundation of visual attention mechanisms, BANet incorporates both forward and backward attention modules. By constructing two feature extraction pathways and employing a feature aggregation module, BANet effectively integrates local and global information, thereby resolving ambiguities.
ABCNet [62]: ABCNet constructs an attention-based dual-path lightweight network with spatial and contextual pathways, thereby reducing the high computational cost associated with fine-grained spatial details and large receptive fields.
MANet [63]: MANet employs an inter- and intra-region refinement (IIRR) method to reduce feature redundancy caused by fusion while constructing a multi-scale collaborative learning (MCL) framework to enhance the diversity of multi-scale feature representations. This approach leverages the correlations between different scales to effectively learn multi-scale features.
TransUNet [20]: TransUNet enhances the U-Net model by introducing a hybrid encoder that combines CNNs and transformers. This integration addresses the limitations of traditional convolutional neural networks in modeling long-range dependencies and processing large-scale images, thereby improving the handling of long-range dependencies, capturing semantic information in images and enhancing the model’s representational capacity and generalization performance.
Segmenter [30]: This network builds on the latest vision transformer (ViT) research, segments images into patches and maps them to a sequence of linear embeddings, which are then encoded by the encoder. The mask transformer decodes the outputs from the encoder and class embeddings, and, after upsampling, applies Argmax to classify each pixel, producing the final pixel segmentation map.
CMTFNet [64]: CMTFNet constructs a transformer decoder based on a multi-scale multi-head self-attention module to extract rich multi-scale global context and channel information. On one hand, the transformer block introduces an efficient feed-forward network (E-FFN) to enhance information interaction across different feature channels. On the other hand, the multi-scale attention fusion (MAF) module effectively integrates feature information from different levels.
DC-Swin [21]: DC-Swin incorporates the Swin-transformer as its encoder and features a densely connected convolutional decoder designed for high-resolution remote sensing image segmentation.
UNetFormer [32]: UNetFormer employs the lightweight ResNet18 as its encoder and has developed an effective global–local attention mechanism to model both global and local information within the decoder.
CM-Unet [65]: CM-UNet integrates the CSMamba module, which is based on spatial and channel attention activation, with a multi-scale attention aggregation module. This effectively captures long-range dependencies and multi-scale global context information in large-scale remote sensing images.
RS3Mamba [66]: RS3Mamba leverages the visual state-space (VSS) model to construct an auxiliary branch and introduces a collaborative completion module (CCM) to enhance and fuse features from dual encoders.
Quantitative Results: The quantitative results of the comparative experiments constructed in this paper are presented in Table 1, Table 2 and Table 3. On the LoveDA remote sensing image dataset, the method achieved the best results in five categories for the IoU metric and surpassed other CNN- and transformer-based methods in terms of mean IoU, indicating superior average performance. Analysis of the categories where the method excelled reveals that LASC-Mamba and Mix-Mamba provide more accurate semantic information in extensive scene contexts, particularly for the forest and agriculture categories. This suggests that LASC-Mamba introduces local texture information while incorporating lower-resolution contextual details. However, in the Building category, the IoU performance did not significantly outperform UNetFormer. The main reason for poor performance in the Barren category is the class imbalance in the LoveDA dataset, leading to unstable network training, with all methods showing either very poor results or irregular IoU distributions in this category. Ultimately, our method demonstrates a 0.5% improvement over the UNetFormer network. Qualitative experiments on the UAVid dataset demonstrate that our method surpasses the baseline UNetFormer in most categories, particularly in the small-object classes of moving car and static car, where we significantly outperform the baseline. However, our approach underperforms in the Low Vegetation category, a shortcoming we explore in the Further Analysis section. Nonetheless, our method achieves the best results in the mF1 score and overall accuracy (OA), improving by 2.99% and 1.34% over UNetFormer, respectively. The visualization results on the Vaihingen dataset indicate that our approach produces outcomes comparable to RS3Mamba, which also utilizes the Mamba architecture, but with superior OA scores. We achieved a notable 4% improvement in the impervious surface category compared to UNetFormer. Consequently, the Tree and Car categories, which are spatially enclosed by impervious surfaces, saw a decrease in performance. Although the improvement in the class-balanced mIoU metric was modest, the classification gain in impervious surface pixels led to a 3.1% increase in OA compared to UNetFormer, exhibiting classification characteristics similar to RS3Mamba.
Qualitative Results: The visualization results are shown in Figure 4, Figure 5 and Figure 6. Our method outperforms the CNN-based network BANet and the purely transformer-based architecture DC-Swin on the LoveDA dataset. Our approach effectively minimizes interference from pixels of different categories in dynamically changing geographical scenes, especially where different category scenes intersect or contain each other, such as the buildings in red. Our method distinctly identifies background and some road information within clusters of buildings. In large-scene distributions such as vegetation, the network model based on the Mamba architecture exhibits superior performance to the UNetFormer network, primarily by maintaining connectivity between the same categories of ground and objects, thus preventing severe visual erosion caused by various types of disturbances. Furthermore, compared to other segmentation results, our method presents smoother mask boundaries and contains fewer noise artifacts, indicating that the contextual information extracted by Mamba is effectively utilized during the encoder phase. The visualization results on the UAVid dataset demonstrate that our method maintains correct semantic relationships in categories with regular shapes. For instance, the logical connections and appropriate depth positioning between vehicles, between vehicles and roads and between trees and low vegetation are accurately preserved, thereby avoiding the boundary confusion seen in UNetFormer, which arises from its reliance on large-scale global information. In contrast, the Vaihingen dataset features more defined contours and shapes across categories compared to UAVid and LoveDA. This results in a more coherent relationship between impervious surfaces and other categories, particularly with an improved encapsulation of small objects, leading to a higher degree of individual object separation. This characteristic is especially evident in the Low Vegetation category.

4.4. Ablation Study

We compared the qualitative and quantitative outcomes of two Mamba architectures under various configurations. Table 4 demonstrates that LASC-Mamba and Mix-Mamba significantly enhance semantic precision across most categories, particularly in cross-resolution geographical classes such as buildings and roads. Mix-Mamba effectively synthesizes global information from remote sensing imagery, ensuring the integrity of the final mask output. This is also evident in the visual representations in Figure 7, where the segmentation of buildings (represented in red) shows a substantial reduction in misclassification within the structures, resulting in more cohesive boundaries. On the other hand, the LASC-Mamba architecture refines the representation for irregular categories and fine-grained local features, incorporating multi-level basic texture information through feature link aggregation for categories like Low Vegetation and Clutter, which lack distinct boundaries, thereby enhancing the accuracy and robustness of edge segmentation. Figure 8 shows that the hybrid architecture achieves a more balanced performance across categories compared to the individual modules, particularly in the Building, Barren and Forest classes. This indicates that the final fusion effectively adapts and allocates global and local information in these categories. Although the mIoU improvement after fusion is modest, it still achieves state-of-the-art results. The visual representation in Figure 8 illustrates how the final fused model integrates the strengths of LASC-Mamba and Mix-Mamba, particularly in restoring cross-category geographical information and complementing localized details.
Notable, the results in Table 5 indicating the integrated final structure incorporating both modules’ achieved optimal results across the majority of categories and the ultimate metric, mean intersection over union (mIoU), demonstrating the efficacy of our approach in utilizing the advantages of the two distinct Mamba modules.

4.5. Further Analysis

Limitations and Failure Cases: We present the failure cases in Figure 9. Some visual results on the Vaihingen dataset indicate that, in complex and irregular large-scale scenes, our method may produce semantic errors compared to baseline methods. These errors are primarily concentrated in the Forest, Barren and Agriculture categories, leading to the loss of information for these high-coverage features. This may be due to the additional parallel Mamba framework in the Mix-Mamba module indirectly reducing the adaptive weight allocation of the multi-head self-attention mechanism, which generates global information, thereby leading to class confusion in large receptive fields. On the other hand, while the LASC-Mamba module effectively smooths cross-scale information, particularly enhancing fine-grained representation and accurate edge segmentation in the final output layer, the allocation of high-dimensional semantic information is limited due to feature dimensions within the linkage, which could also contribute to these failures.
Model Complexity: Table 6 illustrates the comparison of model complexity using two metrics: floating-point operations (FLOPs) and model parameters. Due to the addition of extra modules, our approach does not hold an advantage in terms of parameter count. Compared to the baseline method, our training parameters increased by 33.7%, incurring some computational cost to achieve performance improvements.

5. Discussion

The semantic segmentation of UAV and satellite imagery using deep learning is a crucial technique for feature detection. In recent years, researchers have consistently improved the performance and metrics of remote sensing image classification, detection and segmentation tasks by flexibly combining convolutional neural networks with transformer modules. Most hybrid architectures effectively balance the detailed local features extracted by convolution with the broad global information captured by transformers. Some approaches also reduce input dimensions through downsampling to save computational costs for transformers.
In the field of remote sensing image segmentation, these methods are typically implemented in encoder–decoder structures, particularly in U-Net-like networks, where researchers often directly incorporate hierarchical feature outputs from hybrid or pretrained encoders into corresponding decoder stages. However, this approach overlooks the spatial representation capabilities of cross-scale outputs. This paper aims to enhance the spatiotemporal representation during fusion and decoding stages using the Mamba module. The LASC-Mamba aggregates mixed scales across multiple links, reconstructs cross-scale spatiotemporal relationships and feeds the fused output back into each decoder stage. During decoding, the Mix-Mamba module processes local convolutional features, global self-attention features and discrete SSM representations in parallel to achieve more comprehensive feature reconstruction and semantic information extraction. This multimodal feature fusion method not only enhances the model’s sensitivity to complex feature characteristics but also significantly improves the accuracy and robustness of remote sensing image segmentation. We compared the reconstructed hybrid model with baseline methods and other hybrid architectures in this field, outperforming them across various metrics on multiple standard datasets, demonstrating the model’s robustness and generalization capabilities as an effective universal architecture.
Although our work has improved the metrics and accuracy of feature classification through the construction of LASC-Mamba and Mix-Mamba, there are still some limitations. While the additional Mamba modules have enhanced the baseline method’s performance in terms of IoU and accuracy for various classes, they have also increased the model’s complexity, slightly reducing training and inference speed. Additionally, although our proposed model has shown robustness and versatility in most scenarios, it still underperforms in recognizing certain feature categories, such as the “Building” and “Barren” classes in the LoveDA dataset. This limitation may stem from the model’s insensitivity to certain dataset categories and the feature extraction bottlenecks inherent in the encoder backbone framework. Therefore, further research into underexplored feature semantics will be a focus of our future work.

6. Conclusions

This paper proposes a novel U-Net-like network for UAV image segmentation, addressing the limitations of traditional CNN–transformer networks in capturing fine-grained feature details. By constructing a hybrid model based on the Mamba architecture, we enhance the UNetFormer base model with a multi-scale fusion link aggregation Mamba. This module maps shallow semantic information into a unified state space, selectively sharing long-sequence linear information to address scale allocation issues in cross-layer connections, and reconstructs all outputs into a unified scale space. On the other hand, the Mix-Mamba module combines global–local self-attention’s long-range semantic perception with sequence length linear scaling, constructing a linear modeling mechanism across different channels and spatial dimensions through multi-path interactions. Additionally, the Mix-Mamba module can serve as a plug-and-play component for other general architectures. Experimental evaluations on two public remote sensing image segmentation datasets demonstrate that the proposed method outperforms other CNN- and transformer-based semantic segmentation algorithms, highlighting its significant potential in the field of remote sensing image segmentation.

Author Contributions

Conceptualization, Q.Z. and G.G.; methodology, Q.Z., G.G. and Q.L.; resources, Q.Z. and P.Z.; writing—original draft preparation, Y.W. and Q.L.; writing—review and editing, Q.Z., G.G., P.Z., Y.W., Q.L. and K.L.; visualization, Q.Z. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62271393), Xi’an Science and Technology Plan Project (24SFSF0002), National Social Science Fund of China Major Projects in Art Studies (24ZD10), Key Research and Development Program of Shaanxi Province (2019GY215, 2021ZDLSF06-04), National Natural Science Foundation of China Youth Fund (62302393).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is sourced from open access: https://github.com/Junjue-Wang/LoveDA and https://uavid.nl/.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Xing, J.; Sieber, R.; Caelli, T. A scale-invariant change detection method for land use/cover change research. ISPRS J. Photogramm. Remote Sens. 2018, 141, 252–264. [Google Scholar] [CrossRef]
  2. Yin, H.; Pflugmacher, D.; Li, A.; Li, Z.; Hostert, P. Land use and land cover change in Inner Mongolia-understanding the effects of China’s re-vegetation programs. Remote Sens. Environ. 2018, 204, 918–930. [Google Scholar] [CrossRef]
  3. Shao, P.; Yi, Y.; Liu, Z.; Dong, T.; Ren, D. Novel multiscale decision fusion approach to unsupervised change detection for high-resolution images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 2503105. [Google Scholar] [CrossRef]
  4. Samie, A.; Abbas, A.; Azeem, M.M.; Hamid, S.; Iqbal, M.A.; Hasan, S.S.; Deng, X. Examining the impacts of future land use/land cover changes on climate in Punjab province, Pakistan: Implications for environmental sustainability and economic growth. Environ. Sci. Pollut. Res. 2020, 27, 25415–25433. [Google Scholar] [CrossRef] [PubMed]
  5. Lobo Torres, D.; Queiroz Feitosa, R.; Nigri Happ, P.; Elena Cué La Rosa, L.; Marcato, J., Jr.; Martins, J.; Ola Bressan, P.; Gonçalves, W.N.; Liesenberg, V. Applying fully convolutional architectures for semantic segmentation of a single tree species in urban environment on high resolution UAV optical imagery. Sensors 2020, 20, 563. [Google Scholar] [CrossRef]
  6. Hoeser, T.; Bachofer, F.; Kuenzer, C. Object detection and image segmentation with deep learning on Earth observation data: A review—Part II: Applications. Remote Sens. 2020, 12, 3053. [Google Scholar] [CrossRef]
  7. Chai, B.; Nie, X.; Zhou, Q.; Zhou, X. Enhanced Cascade R-CNN for Multi-scale Object Detection in Dense Scenes from SAR Images. IEEE Sens. J. 2024, 24, 20143–20153. [Google Scholar] [CrossRef]
  8. Zhang, C.; Wang, L.; Yang, R. Semantic segmentation of urban scenes using dense depth maps. In Proceedings of the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Proceedings, Part IV 11. Springer: Berlin/Heidelberg, Germany, 2010; pp. 708–721. [Google Scholar]
  9. Schmitt, M.; Prexl, J.; Ebel, P.; Liebel, L.; Zhu, X.X. Weakly supervised semantic segmentation of satellite images for land cover mapping—Challenges and opportunities. arXiv 2020, arXiv:2002.08254. [Google Scholar] [CrossRef]
  10. Kherraki, A.; Maqbool, M.; El Ouazzani, R. Traffic scene semantic segmentation by using several deep convolutional neural networks. In Proceedings of the 2021 3rd IEEE Middle East and North Africa COMMunications Conference (MENACOMM), Virtual, 3–5 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
  11. Boudissa, M.; Kawanaka, H.; Wakabayashi, T. Semantic segmentation of traffic landmarks using classical computer vision and U-Net model. Proc. J. Phys. Conf. Ser. 2022, 2319, 012031. [Google Scholar] [CrossRef]
  12. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
  13. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
  14. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  15. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
  16. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  17. Wang, J.; HQ Ding, C.; Chen, S.; He, C.; Luo, B. Semi-supervised remote sensing image semantic segmentation via consistency regularization and average update of pseudo-label. Remote Sens. 2020, 12, 3603. [Google Scholar] [CrossRef]
  18. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  19. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar]
  20. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. 2021. Available online: http://arxiv.org/abs/2102.04306 (accessed on 11 June 2024).
  21. Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6506105. [Google Scholar] [CrossRef]
  22. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  23. Panboonyuen, T.; Jitkajornwanich, K.; Lawawirojwong, S.; Srestasathiern, P.; Vateekul, P. Transformer-based decoder designs for semantic segmentation on remotely sensed images. Remote Sens. 2021, 13, 5100. [Google Scholar] [CrossRef]
  24. Xu, Z.; Zhang, W.; Zhang, T.; Yang, Z.; Li, J. Efficient transformer for remote sensing image segmentation. Remote Sens. 2021, 13, 3585. [Google Scholar] [CrossRef]
  25. Adams, R.; Bischof, L. Seeded region growing. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 641–647. [Google Scholar] [CrossRef]
  26. Mary Synthuja Jain Preetha, M.; Padma Suresh, L.; John Bosco, M. Image segmentation using seeded region growing. In Proceedings of the 2012 International Conference on Computing, Electronics and Electrical Technologies (ICCEET), Nagercoil, India, 21–22 March 2012; pp. 576–583. [Google Scholar] [CrossRef]
  27. Athanasiadis, T.; Mylonas, P.; Avrithis, Y.; Kollias, S. Semantic image segmentation and object labeling. IEEE Trans. Circuits Syst. Video Technol. 2007, 17, 298–312. [Google Scholar] [CrossRef]
  28. Liu, Z.; Li, X.; Luo, P.; Loy, C.C.; Tang, X. Deep learning markov random field for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1814–1828. [Google Scholar] [CrossRef]
  29. Vemulapalli, R.; Tuzel, O.; Liu, M.Y.; Chellapa, R. Gaussian conditional random field network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3224–3233. [Google Scholar]
  30. Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
  31. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  32. Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
  33. Zhang, Q.; Geng, G.; Yan, L.; Zhou, P.; Li, Z.; Li, K.; Liu, Q. P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation. arXiv 2024, arXiv:2405.20443. [Google Scholar]
  34. Chen, K.; Zou, Z.; Shi, Z. Building extraction from remote sensing images with sparse token transformers. Remote Sens. 2021, 13, 4441. [Google Scholar] [CrossRef]
  35. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
  36. Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. 2023. Available online: http://arxiv.org/abs/2312.00752 (accessed on 11 June 2024).
  37. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. 2024. Available online: http://arxiv.org/abs/2401.09417 (accessed on 11 June 2024).
  38. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. 2024. Available online: http://arxiv.org/abs/2401.10166 (accessed on 11 June 2024).
  39. Lieber, O.; Lenz, B.; Bata, H.; Cohen, G.; Osin, J.; Dalmedigos, I.; Safahi, E.; Meirom, S.; Belinkov, Y.; Shalev-Shwartz, S.; et al. Jamba: A hybrid transformer-mamba language model. arXiv 2024, arXiv:2403.19887. [Google Scholar]
  40. Xu, J. HC-Mamba: Vision MAMBA with Hybrid Convolutional Techniques for Medical Image Segmentation. arXiv 2024, arXiv:2405.05007. [Google Scholar]
  41. Chen, S.; Atapour-Abarghouei, A.; Zhang, H.; Shum, H.P. MxT: Mamba × Transformer for Image Inpainting. arXiv 2024, arXiv:2407.16126. [Google Scholar]
  42. Wang, Y.; Liu, Y.; Deng, D.; Wang, Y. Reunet: An Efficient Remote Sensing Image Segmentation Network. In Proceedings of the 2023 International Conference on Machine Learning and Cybernetics (ICMLC), Adelaide, Australia, 9–11 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 63–68. [Google Scholar]
  43. Cao, Y.; Liu, S.; Peng, Y.; Li, J. DenseUNet: Densely connected UNet for electron microscopy image segmentation. IET Image Process. 2020, 14, 2682–2689. [Google Scholar] [CrossRef]
  44. Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4. Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
  45. Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1055–1059. [Google Scholar]
  46. Zhang, C.; Wang, R.; Chen, J.W.; Li, W.; Huo, C.; Niu, Y. A Multi-Branch U-Net for Water Area Segmentation with Multi-Modality Remote Sensing Images. In Proceedings of the IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 5443–5446. [Google Scholar] [CrossRef]
  47. Yue, K.; Yang, L.; Li, R.; Hu, W.; Zhang, F.; Li, W. TreeUNet: Adaptive tree convolutional neural networks for subdecimeter aerial image segmentation. ISPRS J. Photogramm. Remote Sens. 2019, 156, 1–13. [Google Scholar] [CrossRef]
  48. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
  49. Beal, J.; Kim, E.; Tzeng, E.; Park, D.H.; Zhai, A.; Kislyuk, D. Toward transformer-based object detection. arXiv 2020, arXiv:2012.09958. [Google Scholar]
  50. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  51. Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
  52. Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
  53. Gu, A.; Dao, T.; Ermon, S.; Rudra, A.; Ré, C. Hippo: Recurrent memory with optimal polynomial projections. Adv. Neural Inf. Process. Syst. 2020, 33, 1474–1487. [Google Scholar]
  54. Xiao, Y.; Yuan, Q.; Jiang, K.; Chen, Y.; Zhang, Q.; Lin, C.W. Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution. arXiv 2024, arXiv:2405.04964. [Google Scholar]
  55. Zhang, H.; Chen, K.; Liu, C.; Chen, H.; Zou, Z.; Shi, Z. CDMamba: Remote Sensing Image Change Detection with Mamba. arXiv 2024, arXiv:2406.04207. [Google Scholar]
  56. Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic segmentation of remotely sensed images with state space model. arXiv 2024, arXiv:2404.01705. [Google Scholar] [CrossRef]
  57. Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
  58. Lyu, Y.; Vosselman, G.; Xia, G.S.; Yilmaz, A.; Yang, M.Y. UAVid: A semantic segmentation dataset for UAV imagery. ISPRS J. Photogramm. Remote Sens. 2020, 165, 108–119. [Google Scholar] [CrossRef]
  59. Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
  60. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  61. Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer meets convolution: A bilateral awareness network for semantic segmentation of very fine resolution urban scene images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
  62. Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive bilateral contextual network for efficient semantic segmentation of Fine-Resolution remotely sensed imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
  63. He, P.; Jiao, L.; Shang, R.; Wang, S.; Liu, X.; Quan, D.; Yang, K.; Zhao, D. MANet: Multi-Scale Aware-Relation Network for Semantic Segmentation in Aerial Scenes. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5624615. [Google Scholar] [CrossRef]
  64. Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and Multiscale Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2004612. [Google Scholar] [CrossRef]
  65. Liu, M.; Dan, J.; Lu, Z.; Yu, Y.; Li, Y.; Li, X. CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation. arXiv 2024, arXiv:2405.10530. [Google Scholar]
  66. Ma, X.; Zhang, X.; Pun, M.O. RS 3 Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
Figure 1. Overall architecture of the proposed network. The network consists of three main components: a CNN encoder, an LASC-Mamba skip connection structure and a Mix-Mamba decoder.
Figure 1. Overall architecture of the proposed network. The network consists of three main components: a CNN encoder, an LASC-Mamba skip connection structure and a Mix-Mamba decoder.
Remotesensing 16 03622 g001
Figure 2. The structure of LASC-Mamba. LASC-Mamba transforms two-dimensional features across four different scales and aggregates the one-dimensional outputs through link aggregation, ultimately restoring feature dimensions and utilizing skip connections at the terminal stage.
Figure 2. The structure of LASC-Mamba. LASC-Mamba transforms two-dimensional features across four different scales and aggregates the one-dimensional outputs through link aggregation, ultimately restoring feature dimensions and utilizing skip connections at the terminal stage.
Remotesensing 16 03622 g002
Figure 3. The structure of Mix-Mamba. Comprises a local feature extraction module on the left, a global self-attention mechanism in the middle and a Mamba module on the right.
Figure 3. The structure of Mix-Mamba. Comprises a local feature extraction module on the left, a global self-attention mechanism in the middle and a Mamba module on the right.
Remotesensing 16 03622 g003
Figure 4. Visualization of results on the LoveDA validation set.
Figure 4. Visualization of results on the LoveDA validation set.
Remotesensing 16 03622 g004
Figure 5. Visualization of results on the UAVid validation set.
Figure 5. Visualization of results on the UAVid validation set.
Remotesensing 16 03622 g005
Figure 6. Visualization of results on the Vaihaigen test set.
Figure 6. Visualization of results on the Vaihaigen test set.
Remotesensing 16 03622 g006
Figure 7. Visualization of results on the UAVid validation set.
Figure 7. Visualization of results on the UAVid validation set.
Remotesensing 16 03622 g007
Figure 8. Visualization of results on the LoveDA validation set.
Figure 8. Visualization of results on the LoveDA validation set.
Remotesensing 16 03622 g008
Figure 9. Failure cases on the Vaihingen dataset.
Figure 9. Failure cases on the Vaihingen dataset.
Remotesensing 16 03622 g009
Table 1. Quantitative comparison results on the LoveDA test set with recent research methodologies. The best values in each column are highlighted in bold.
Table 1. Quantitative comparison results on the LoveDA test set with recent research methodologies. The best values in each column are highlighted in bold.
MethodBackgroundBuildingRoadWaterBarrenForestAgriculturemIoU
ABCNet41.856.650.777.114.945.254.248.6
MANet41.855.153.475.514.644.455.148.5
BANet43.751.551.176.916.644.962.549.6
TransUNet43.056.153.778.09.344.956.948.9
Segmenter38.050.748.777.413.343.558.247.1
A2FPN42.757.654.378.014.145.054.949.5
DC-Swin43.354.354.378.714.945.359.650.0
UNetFormer44.758.854.979.620.146.062.552.4
CM-Unet54.664.155.568.129.642.950.452.2
RS3Mamba39.758.857.961.037.239.734.050.9
Ours46.158.455.679.919.247.663.252.9
Table 2. Quantitative comparison results on the UAVid set with recent research methodologies. The best values in each column are highlighted in bold.
Table 2. Quantitative comparison results on the UAVid set with recent research methodologies. The best values in each column are highlighted in bold.
MethodBuildingRoadTreeLowVeg.Mov CarStatic CarHumanCluttermIoUmF1OA
BANet90.0475.6678.1168.4270.1764.1242.6862.2768.9380.8886.95
DC-Swin92.6778.9577.7367.0869.2764.5937.2765.9169.280.8087.90
A2FPN90.8377.4577.9768.6667.0764.946.9163.2169.6281.4887.3
LSKNet-T91.8073.8379.0969.4775.8569.4346.8560.7270.8982.3287.34
MANet91.7778.1078.5969.1472.2069.4848.5065.2871.6382.9287.99
UNetFormer90.6476.4577.5267.7671.2267.3846.6262.7070.0081.6487.21
Ours92.4582.9482.3558.5679.8470.4750.6463.7472.6384.6388.55
Table 3. Quantitative comparison results on the Vaihingen with recent research methodologies. The best values in each column are highlighted in bold.
Table 3. Quantitative comparison results on the Vaihingen with recent research methodologies. The best values in each column are highlighted in bold.
MethodImp. Surf.BuildingLow.veg.TreeCarmF1mIoUOA
DANet90.093.982.287.344.579.669.488.2
ABCNet92.795.284.589.785.389.581.390.7
BANet92.295.283.889.986.889.681.490.5
Segmenter89.893.081.288.967.684.173.688.1
ESDINet92.795.584.590.087.290.082.090.9
UNetFormer92.795.384.990.688.590.482.790.4
CMTFNet90.694.281.987.682.887.478.088.7
RS3Mamba96.795.584.490.086.990.783.393.2
Ours96.796.086.189.584.790.683.293.5
Table 4. Quantitative results of ablation experiments conducted on the UAVid dataset. The best values in each column are displayed in bold.
Table 4. Quantitative results of ablation experiments conducted on the UAVid dataset. The best values in each column are displayed in bold.
UNetFormerLASC-MambaMix-MambaBuildingRoadTreeLowVeg.Mov CarStatic CarHumanCluttermIoU
--90.6476.4577.5267.7671.2267.3846.6262.7070.0
-91.1878.1878.1368.9272.5068.9647.5764.1871.20
-91.3578.7978.3768.2473.0170.2947.9965.2071.66
92.4582.9482.3558.5679.8470.4750.6463.7472.63
Table 5. Quantitative results of ablation experiments conducted on the LoveDA dataset. The best values in each column are displayed in bold.
Table 5. Quantitative results of ablation experiments conducted on the LoveDA dataset. The best values in each column are displayed in bold.
UNetFormerLASC-MambaMix-MambaBackgroundBuildingRoadWaterBarrenForestAgriculturemIoU
--44.758.854.979.620.146.062.552.4
-46.757.857.380.915.847.563.752.8
-46.158.358.579.618.546.961.552.8
46.158.455.579.919.247.663.252.9
Table 6. Measurements of the computational complexity analysis on a single NVIDIA 3090 GPU.
Table 6. Measurements of the computational complexity analysis on a single NVIDIA 3090 GPU.
MethodFLOPs (G)Param. (M)
ABCNet7.8113.39
CMTFNet17.1430.07
UNetformer5.8711.69
RS3Mamba31.6543.32
CM-UNet6.0112.89
Ours7.8915.63
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Q.; Geng, G.; Zhou, P.; Liu, Q.; Wang, Y.; Li, K. Link Aggregation for Skip Connection–Mamba: Remote Sensing Image Segmentation Network Based on Link Aggregation Mamba. Remote Sens. 2024, 16, 3622. https://doi.org/10.3390/rs16193622

AMA Style

Zhang Q, Geng G, Zhou P, Liu Q, Wang Y, Li K. Link Aggregation for Skip Connection–Mamba: Remote Sensing Image Segmentation Network Based on Link Aggregation Mamba. Remote Sensing. 2024; 16(19):3622. https://doi.org/10.3390/rs16193622

Chicago/Turabian Style

Zhang, Qi, Guohua Geng, Pengbo Zhou, Qinglin Liu, Yong Wang, and Kang Li. 2024. "Link Aggregation for Skip Connection–Mamba: Remote Sensing Image Segmentation Network Based on Link Aggregation Mamba" Remote Sensing 16, no. 19: 3622. https://doi.org/10.3390/rs16193622

APA Style

Zhang, Q., Geng, G., Zhou, P., Liu, Q., Wang, Y., & Li, K. (2024). Link Aggregation for Skip Connection–Mamba: Remote Sensing Image Segmentation Network Based on Link Aggregation Mamba. Remote Sensing, 16(19), 3622. https://doi.org/10.3390/rs16193622

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop