Research on Remote Sensing Image Object Segmentation Using a Hybrid Multi-Attention Mechanism

Chen, Lei; Li, Changliang; Gao, Yixuan; Chang, Yujie; Jin, Siming; Wang, Zhipeng; Ma, Xiaoping; Jia, Limin

doi:10.3390/app16020695

Open AccessArticle

Research on Remote Sensing Image Object Segmentation Using a Hybrid Multi-Attention Mechanism

by

Lei Chen

¹,

Changliang Li

²,

Yixuan Gao

²,

Yujie Chang

¹,

Siming Jin

¹,

Zhipeng Wang

^1,*,

Xiaoping Ma

^1,*

and

Limin Jia

¹

School of Traffic and Transportation, Beijing Jiaotong University, Beijing 100044, China

²

Innovation Center on Intelligent Collaboration of Satellites and Cloud, CASIC Space Engineering Development Co., Ltd., Beijing 100854, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(2), 695; https://doi.org/10.3390/app16020695

Submission received: 5 December 2025 / Revised: 5 January 2026 / Accepted: 6 January 2026 / Published: 9 January 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

High-resolution remote sensing images are gradually playing an important role in land cover mapping, urban planning, and environmental monitoring tasks. However, current segmentation approaches frequently encounter challenges such as loss of detail and blurred boundaries when processing high-resolution remote sensing imagery, owing to their complex backgrounds and dense semantic content. In response to the aforementioned limitations, this study introduces HMA-UNet, a novel segmentation network built upon the UNet framework and enhanced through a hybrid attention strategy. The architecture’s innovation centers on a composite attention block, where a lightweight split fusion attention (LSFA) mechanism and a lightweight channel-spatial attention (LCSA) mechanism are synergistically integrated within a residual learning structure to replace the stacked convolutional structure in UNet, which can improve the utilization of important shallow features and eliminate redundant information interference. Comprehensive experiments on the WHDLD dataset and the DeepGlobe road extraction dataset show that our proposed method achieves effective segmentation in remote sensing images by fully utilizing shallow features and eliminating redundant information interference. The quantitative evaluation results demonstrate the performance of the proposed method across two benchmark datasets. On the WHDLD dataset, the model attains a mean accuracy, IoU, precision, and recall of 72.40%, 60.71%, 75.46%, and 72.41%, respectively. Correspondingly, on the DeepGlobe road extraction dataset, it achieves a mean accuracy of 57.87%, an mIoU of 49.82%, a mean precision of 78.18%, and a mean recall of 57.87%.

Keywords:

deep learning; remote sensing image; attention mechanism; semantic segmentation; feature optimization

1. Introduction

The image data acquired through high-resolution remote sensing imaging technology contains detailed spatial information that is vital for numerous applications, such as urban planning [1], disaster assessment [2], intelligent traffic systems [3], and land use monitoring [4]. The precise segmentation of these images is therefore a task of significant practical importance. Historically, segmentation has relied on conventional techniques, including region growing [5], thresholding [6], and methods based on handcrafted features [7]. These approaches, however, often demand considerable manual effort for parameter tuning and feature design. More critically, their underlying feature extraction mechanisms frequently exacerbate intra-class heterogeneity while weakening inter-class separability within the image data, ultimately leading to suboptimal segmentation accuracy.

The methodological evolution in the remote sensing image analysis field has been increasingly dominated by deep learning techniques over recent years, most notably through the widespread adoption and success of convolutional neural networks (CNNs). The pioneering work of AlexNet [8] marked a significant shift towards deep learning in image-related tasks, subsequently inspiring a substantial body of research applying advanced networks to various remote sensing challenges, including object detection [9,10,11,12,13], semantic segmentation [14,15,16,17,18], and environmental monitoring [19]. While traditional CNNs were limited in handling spatial context, fully convolutional networks (FCNs) [20] overcame this by enabling pixel-wise prediction and preserving spatial information. However, a key limitation of FCNs is their tendency to lose detailed information during upsampling, as they do not explicitly model pixel-level relationships, often resulting in imprecise object boundaries. To address these issues, several enhanced architectures derived from the FCN framework have been proposed, such as UNet [21], DeepLabV3+ [22], and SegNet [23]. These models are designed to mitigate the problem of weakly correlated features across different network depths and positions, as well as the difficulty in fusing multi-level feature maps directly. Among them, UNet has established itself as a fundamental architecture for image segmentation. Its design leverages skip connections to integrate high-resolution features from the encoder path with the upsampled output from the decoder path. This mechanism effectively preserves fine-grained spatial details alongside high-level semantic information during the reconstruction phase. Drawing inspiration from UNet’s symmetrical design, SegNet [23] also employs an encoder–decoder structure. In its encoder, pooling operations progressively reduce feature map resolution, while the corresponding decoder utilizes memorized pooling indices from the encoder to perform deconvolution operations, thereby recovering spatial details and original dimensions. The superior segmentation capability of UNet has motivated considerable further research, yielding numerous high-performing network variants. For instance, MF-UNet [24] introduces an innovative multi-scale feature extraction block. This module aggregates shallow features at various scales, and by substituting the original skip connections, it alleviates the semantic gap, particularly evident in building segmentation tasks. In another advancement, Dense-UNet [25] incorporates a median-frequency-balanced focal loss function to enhance segmentation accuracy for underrepresented or small object categories. Nonetheless, these advanced methods still exhibit certain shortcomings. A primary limitation lies in their insufficient capacity to model and incorporate global contextual relationships among all pixels, which can lead to the loss of important long-range information. Additionally, they often fail to fully exploit the contextual dependencies between different semantic categories. This inadequacy frequently causes misclassification in the segmentation outputs, blurring inter-class distinctions and reducing overall separability.

Segmenting large-scale targets in high-resolution remote sensing imagery presents significant challenges, primarily due to scene complexity, varying illumination conditions, and diverse imaging perspectives. These difficulties stem mainly from two inherent issues: (1) High intra-class heterogeneity: The substantial variability in the visual characteristics of objects, when observed across different temporal periods and geographical locations, poses significant challenges for consistent analysis. Urban building types, such as residential blocks and commercial complexes, possess distinct spectral characteristics influenced by architectural design and materials. Similarly, cultivated land in mountainous and plains has various properties, despite being classified under the same land use category. (2) Low inter-class separability: Different object categories frequently exhibit overlapping or similar feature representations, which can induce misclassification during the segmentation process. For instance, in urban settings, roads are prone to being confused with building shadows due to their spectral and textural resemblance. Likewise, residential gardens or courtyards are often misidentified as buildings. These elements collectively make it challenging to accurately judge the category of pixels along object boundaries, resulting in blurred edges and segmentation inaccuracies. Conventional convolutional stacking methods struggle to capture precise semantic information at object boundaries, thereby failing to achieve accurate target delineation.

To address these issues, the attention mechanism [26,27,28,29,30] has emerged as a key strategy, leading to the development of a variety of advanced image segmentation models that extend the powerful UNet framework. Attention UNet [31], for instance, incorporates an attention gate to focus adaptively on target structures of varying shapes and sizes. It selectively suppresses irrelevant areas in the input image and preserves and enhances features that are critical to the given task. Subsequent models like [32,33] employ multi-scale and multi-branch attention modules, respectively, to better unify semantic features and capture contextual dependencies. To mitigate information loss in skip connections, [34] introduces a spatial-channel attention gate, while [35] enhances skip connections by combining a linear attention mechanism with residual connections. Further refinements include the modified attention gate proposed by [36], which relocates the resampler with a sigmoid function to bridge multi-level features, culminating in the attention gate UNet architecture. For high-resolution remote sensing imagery, the holistic nested attentional UNet [37] integrates attention with multi-scale nested modules to significantly improve the extraction of fine edge details.

Nevertheless, current attention-based models designed for segmentation in remote sensing imagery continue to face difficulties in retaining fine structural details and producing segmentation results with crisp boundaries. This limitation becomes particularly pronounced in complex scenes containing various interfering elements, such as tree shadows and parking areas, which often lead to insufficient extraction of accurate features and consequently cause numerous segmentation errors. To address these challenges, this paper proposes a novel hybrid multi-attention architecture and introduces HMA-UNet, an attention-enhanced network specifically optimized for high-resolution remote sensing image segmentation. The main contributions of this work are summarized as follows:

(1) We construct a feature extractor named lightweight channel spatial attention (LCSA). It utilizes attention mechanisms to capture channel-level features from the input feature map in the channel dimension and spatial-level feature information in the spatial dimension, respectively. The LCSA module integrates channel-wise and spatial-wise attention in a coordinated manner, applying each mechanism independently to the feature maps. This design effectively combines the complementary strengths of the two attention forms, significantly enhancing the network’s ability to screen and prioritize informative features. Consequently, it achieves a more precise and comprehensive feature representation.

(2) Building on the concept of adaptive feature refinement through attention, this paper proposes a multi-scale feature fusion module named lightweight split and fuse attention (LSFA). The LSFA module first divides the input features into two branches with varied receptive fields, then selectively recalibrates and aggregates them. This enables the network to dynamically adjust the emphasis on different spatial scales and contextual cues per channel. As a result, the module yields a more flexible and discriminative feature representation, which improves the model’s ability to recognize objects across diverse scales.

(3) We propose HMA-UNet, which incorporates a reinforced backbone network to address the semantic spatial gap between lower-level and higher-level feature representations in the standard UNet architecture. Specifically, we redesign the backbone by embedding the LSFA module in residual form at shallow stages to enhance multi-scale feature aggregation, and integrating the LCSA module at deeper stages to refine channel-wise and spatial feature responses. This structured enhancement enables the network to progressively emphasize foreground relevant information while suppressing background interference, ultimately leading to improved segmentation accuracy.

The following presents a roadmap of this paper: Section 2 offers a comprehensive overview of the proposed LCSA and LSFA modules, describing their architectural design and core principles. We then present HMA-UNet, a hybrid multi-attention segmentation network that integrates the LCSA and LSFA modules, specifically developed for segmentation in remote sensing imagery. Section 3 elaborates on the experimental configuration, encompassing descriptions of the benchmark datasets (i.e., the WHDLD Dataset [38] and the DeepGlobe Dataset [39]), the implementation environment, and the evaluation indicators. Section 4 is dedicated to the experimental findings, where we discuss the performance of HMA-UNet in various segmentation scenarios. Finally, the study culminates in Section 5 with a summary of the principal outcomes and insights obtained from the comprehensive experiments.

2. Methodology

To achieve precise target segmentation and effective background suppression in remote sensing imagery, this paper employs the attention mechanism as a core strategy. The mechanism facilitates dynamic feature recalibration by guiding the model to emphasize regions of interest while attenuating irrelevant background content, thereby enhancing the discriminative power of target representations. To mitigate persistent issues such as segmentation ambiguity, inadequate target feature extraction, and frequent false positives in complex scenarios, we propose the LCSA and LSFA modules, which are specifically engineered to address the heterogeneous and variable characteristics inherent to remote sensing images.

2.1. LCSA Module

The LCSA module integrates two sequentially connected attention units, a channel attention unit and a spatial attention unit. The channel attention mechanism employs adaptive max pooling and adaptive average pooling followed by a convolution operator and activation function to generate a channel-wise weighting vector, thereby highlighting the most informative feature channels. Subsequently, the spatial attention unit utilizes channel-reduced global features, going through another similar processing link, to produce a spatial weighting map that enhances salient regions across all feature layers. The detailed architecture of the LCSA module is illustrated in Figure 1.

The channel attention unit initiates processing by transforming the input feature tensor with dimension

C \times H \times W

into a compact representation through global average pooling applied across spatial dimensions. This operation aggregates spatial information from all channels, yielding a channel-wise descriptor of size

C \times 1 \times 1

. The descriptor is subsequently refined by a convolution operation that reduces its dimensionality. A Sigmoid activation function is subsequently applied to the convolution output for producing the channel-level weights, a vector of size

C \times 1 \times 1

where each element indicates the importance of the corresponding channel for feature representation. Feature refinement is achieved by performing element-wise multiplication between the above weights and the original feature tensor, thereby generating a refined feature map, which serves as the input to the subsequent spatial attention stage.

We then apply a global max pooling operation on the input feature tensor along the channel dimension to obtain an output feature with a size of

1 \times H \times W

and select the maximum activation at each spatial location

(i, j)

. The resulting single-channel feature map undergoes a convolution operation using a

3 \times 3

kernel with a stride of 1, which is subsequently normalized by a Sigmoid function, yielding the spatial attention stage weight with a size of

1 \times H \times W

. This weight map encodes the relative importance of each spatial position. Finally, the output feature map is obtained by performing an element-wise multiplication between the spatial attention weight map and the input feature map, thereby emphasizing informative regions while suppressing less relevant areas.

The orange and blue pathways in Figure 1 show two different attention processing patterns in the LCSA module. While both pathways share an identical architectural framework, they differ fundamentally in their feature aggregation strategies. Within the orange pathway, the channel attention sub-module employs global max pooling to generate channel-wise statistics, capturing the most salient feature response per channel, as opposed to the global average pooling utilized in the blue pathway. Conversely, for spatial attention, the orange pathway adopts global average pooling across channels to compile a spatial weight map, contrasting with the global max pooling operation applied in the blue pathway. Ultimately, the feature maps output by these two branches are aggregated via element-wise summation to produce the final output of the LCSA module.

The LCSA module’s design incorporates dual considerations. First, to mitigate errors inherent in feature extraction, primarily stemming from increased estimation variance due to limited receptive fields and bias in the estimated mean values arising from parameter inaccuracies in convolutional layers. Average pooling is utilized to suppress the variance of intra-group and retain features of interest. Max pooling, on the other hand, serves to amplify inter-group variance and retain richer texture details. These two parallel branches enable the aggregation of multi-perspective features from different pooling operations, thereby capturing more comprehensive channel-wise and spatial detail information to enhance segmentation accuracy. Second, the module sequentially emphasizes both critical channels and significant spatial locations in the feature maps. It first evaluates and recalibrates channel importance through channel attention, followed by spatial matrix operations that refine spatial information. This sequential enhancement prioritizes channel information before modulating spatial relationships, which improves the extraction and utilization of target information effectively. Consequently, it preserves better edge details and strengthens the network’s overall discriminative capability.

2.2. LSFA Module

The detailed architecture of the LSFA module is illustrated in Figure 2. The LSFA module operates through two sequential stages: multi-scale feature extraction and dynamic feature fusion. In the first stage, the input features are processed in parallel by two distinct convolutional branches to capture complementary contextual information. One branch employs a

3 \times 3

convolutional kernel, while the other utilizes a

5 \times 5

kernel. The resulting feature maps from both branches are then concatenated along the channel dimension, forming an integrated multi-scale feature representation.

The subsequent stage is designed to adaptively recalibrate and integrate the contributions from these multi-scale features. Initially, the feature maps from all branches are element-wise summed. Global average pooling is then applied to compress the spatial information into a channel-wise descriptor. This compact descriptor is processed by two parallel

1 \times 1

convolutional layers, which generate a set of adaptive weighting coefficients corresponding to each scale. Finally, the original multi-scale features are aggregated via a weighted summation based on these learned coefficients, followed by a SoftMax normalization to ensure a probabilistic weighting distribution. This mechanism enables the network to dynamically emphasize the most informative feature scales, thereby refining the final feature representation for enhanced discriminative capability.

2.3. HMA-UNet Structure

A key strength of the UNet architecture lies in its use of skip connections between the convolutional encoder–decoder architecture, which helps mitigate the loss of fine-grained detail during downsampling and feature propagation. However, conventional skip connections treat all incoming shallow features equally, without distinguishing between semantically relevant information and potentially disruptive elements. This undifferentiated fusion can limit segmentation precision and introduce artifacts in the final output. Inspired by the human visual system, which selectively attends to salient stimuli while suppressing irrelevant context, attention mechanisms provide an effective means to enhance feature selectivity. By dynamically weighting feature contributions, such mechanisms improve the efficiency of information extraction and focus computational resources on task-relevant regions. To address the limitations of standard skip connections, this study introduces learned attention weighting into the feature aggregation pathway. Specifically, the stacked convolutional layers conventionally used in the backbone of UNet are replaced with the proposed LCSA and LSFA modules. These modules adaptively assign importance weights to different components of the shallow feature maps, thereby conveying the features most relevant to the target segmentation task through skip connections. This refinement enables the network to concentrate on semantically meaningful patterns, leading to improved segmentation quality and robustness against irrelevant interference.

HMA-UNet is an end-to-end deep convolutional neural network built upon the U-Net architecture. The model comprises two primary sub-networks: a downsampling encoder for hierarchical feature extraction from input RGB images, and a corresponding upsampling decoder dedicated to progressively reconstructing the feature maps and restoring spatial resolution. The final output is a segmentation map that matches the resolution of the initial input image. The proposed LCSA and LSFA attention modules are integrated into both sub-networks via residual connections. For the LCSA mechanism, a channel-spatial attention residual module is implemented (indicated in the blue block within the architecture diagram). Correspondingly, the LSFA mechanism is realized through a split-fusion attention residual module (represented by the yellow block). Skip connections are established between corresponding layers of the encoder and decoder to preserve fine-grained spatial information. A final convolutional layer with a

1 \times 1

kernel is applied at the network output to generate the pixel-wise segmentation result. The complete architecture of HMA-UNet is illustrated in Figure 3.

In the encoder stage, the encoder accepts RGB remote sensing images with dimensions of

256 \times 256 \times 3

as input. Initially, each image is processed by a two-layer convolutional block (represented by the green block), which incorporates two consecutive

3 \times 3

convolutional layers with a stride of 1, each followed by batch normalization and a ReLU activation function. The resulting feature map is then processed along two paths. One path applies a

2 \times 2

max pooling operation, halving the spatial dimensions, while the other forwards the feature map directly to the corresponding decoder layer via a skip connection. Subsequently, the pooled features are passed sequentially through the LCSA and LSFA residual blocks, with their outputs also being routed to the corresponding upsampling stage in the decoder. This sequence of convolution and pooling is repeated four times throughout the encoder. As a result, the number of feature channels progressively expands from an initial count of 3 to 512, while the spatial resolution is reduced to one-sixteenth of that of the initial input. Within the network architecture, at its innermost stage, the channel count is further increased to 1024, enabling the capture of higher-level semantic features essential for effective model training and yielding more precise segmentation outcomes.

In the decoder stage of HMA-UNet, the 512-channel feature maps are first upsampled using a

2 \times 2

filter, which reduces the channel dimension to 256 while doubling the spatial size. This step ensures dimensional alignment with the corresponding encoder layer. The upsampled features are then concatenated with the output from the LSFA residual module in the encoder via skip connection. The concatenated result is subsequently processed by the LCSA residual module within the decoder. Following this, the output undergoes another upsampling operation, further reducing the channels to 128 and again doubling the spatial resolution. This sequence of upsampling and attention-based skip connection fusion is repeated four times throughout the decoder. As a result, the feature maps are progressively restored to the original input spatial dimensions of

256 \times 256

, while the channel count is gradually reduced to 32. Finally, the features are passed through a 1 × 1 convolutional layer, which produces an output with a channel number equal to the total segmentation categories.

The HMA-UNet employs a weighted combination of Dice loss and cross-entropy loss to train the network, as shown in Equation (1), where

λ

is set to 0.6.

L o s s = λ D i c e L o s s_{m u l t i - c l a s s} + (1 - λ) C E L o s s

(1)

The Dice loss function is an important loss function in the field of image segmentation, as it directly optimizes the objective related to evaluation metrics, thereby alleviating class imbalance issues. In this paper, the multi-class Dice loss function is expressed as Equation (2).

D i c e L o s s_{m u l t i - c l a s s} = \frac{1}{C} \sum_{c = 1}^{C} [1 - \frac{2 \sum_{i = 1}^{N} p_{i, c} g_{i, c} + ε}{\sum_{i = 1}^{N} p_{i, c} + \sum_{i = 1}^{N} g_{i, c} + ε}]

(2)

where

C

denotes the number of classes,

N

represents the total number of pixels,

p_{i, c}

denotes the predicted probability that pixel

i

belongs to class

c

, and

g_{i, c}

indicates whether the true label of pixel

i

is category

c

. If true, the value is 1; otherwise, it is 0.

ε

is a smoothing term introduced to reduce overfitting, thereby stabilizing the training process.

C E L o s s = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} g_{i, c} \cdot \log (p_{i, c})

(3)

where

N

denotes the total number of pixels in the image,

C

is the number of classes,

g_{i, c}

is an indicator function that equals 1 if the ground-truth class of pixel

i

is

c

, and 0 otherwise.

p_{i, c}

represents the probability predicted by the model that pixel

i

belongs to class

c

. We combine Dice loss with cross-entropy loss to form a composite loss function, leveraging the strengths of both to achieve a more robust and accurate segmentation model.

3. Experiment

3.1. Experimental Data

The evaluation in this study employs two publicly accessible remote sensing datasets, the WHDLD Dataset [38] (https://sites.google.com/view/zhouwx/dataset#h.p_ebsAS1Bikmkd, accessed on 26 March 2019) and the DeepGlobe Road Extraction Dataset [39] (http://deepglobe.org/challenge.html, accessed on 18 June 2018). From the WHDLD dataset, which encompasses six land cover categories. The dataset comprises 4940 images, with a uniform resolution of

256 \times 256

pixels per image. For experimental purposes, it is organized into 4446 images for training and the remaining 494 for testing. For the DeepGlobe road extraction dataset, 5603 images of size

1024 \times 1024

are used for training, with a separate set of 623 images reserved for testing. To enhance the diversity of the training data and improve model generalization, standard data augmentation techniques, including horizontal flipping and random adjustments to brightness and contrast, are applied during the preprocessing stage.

3.2. Experimental Environment

All experiments, including the proposed HMA-UNet and the comparative models, were developed in PyTorch 2.8.0 and trained with the Adam optimizer (batch size 32, initial learning rate 1 × 10⁻³). A cosine annealing strategy adjusted the learning rate over 100 epochs. All images were rescaled to a size of 256 × 256 pixels for consistency. HMA-UNet was optimized using the weighted hybrid loss function defined in Equation (1). The corresponding training loss curve, depicted in Figure 4, demonstrates a steady decline and eventual convergence within the range of 0.4 to 0.6. All training and evaluation procedures were executed on a computational platform running Ubuntu 20.04, equipped with an Intel(R) Xeon(R) Gold 6459C CPU and an NVIDIA RTX 5090 GPU.

3.3. Evaluation Metrics

A quantitative assessment was conducted using the confusion matrix as a fundamental analytical tool in supervised classification tasks. In the context of image accuracy assessment, it provides the model’s predictions against the reference annotations by tabulating the frequency of correct and incorrect classifications. The core components of the confusion matrix are calculated based on the following criteria: (1) True Positives (TP) represent the count of instances that are accurately predicted as positive when their true label is positive; (2) False Negative (FN), the number of positive samples incorrectly predicted as negative; (3) False Positive (FP), the number of negative samples incorrectly predicted as positive; and (4) True Negative (TN), the number of samples correctly predicted as negative when their true label is negative.

These four metrics form the basis for deriving quantitative evaluation measures, including metrics for accuracy, region-based metric (Intersection over Union, IoU), and a pair of complementary metrics: precision and recall. The mathematical formulations and implications of these evaluation metrics are elaborated below.

Accuracy serves as a fundamental and widely adopted metric for evaluating segmentation performance. In general, a higher accuracy value indicates superior overall model performance. The per-class accuracy is defined as the ratio of the number of true positive predictions to the total number of actual positive instances in each class, as shown in Equation (4).

A c c_{i} = \frac{T P_{i}}{T P_{i} + F N_{i}}

(4)

where

T P_{i}

denotes the number of true positives for class

i

, and

F N_{i}

represents the number of false negatives for class

i

. To fairly assess the model’s recognition capability across different land cover categories, we also adopt mean accuracy as one of the evaluation metrics. The mean accuracy is defined as the arithmetic mean of the per-class accuracies, as given in Equation (5).

m A c c = \frac{1}{C} \sum_{i = 1}^{C} \frac{T P_{i}}{T P_{i} + F N_{i}}

(5)

where

C

, in this context, represents the overall class cardinality.

IoU is a standard metric for semantic segmentation tasks. It quantifies the similarity between the ground truth and the prediction by first determining their overlapping region and then computing the ratio of this intersection area to the total area for each class. Formally, for a given class, the IoU is computed as the number of TP pixels divided by the sum of TP, FP, and FN pixels. The mIoU is subsequently obtained by aggregating the per-class IoU scores into a single average, offering a consolidated assessment of overall segmentation performance that balances both object-level and pixel-level correctness, as shown in Equation (6).

I o U_{i} = \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}} m I o U = \frac{1}{C} \sum_{i = 1}^{C} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

(6)

where

F P_{i}

denotes the number of false positives for class

i

.

Precision is referred to as the positive predictive value, which measures the reliability of a model’s positive predictions. It is defined as the ratio of correctly predicted positive samples to all samples predicted as positive. A higher precision value generally indicates that the model makes fewer false positive errors when identifying a specific class. Therefore, precision provides a focused assessment of the classifier’s exactness for a particular category, as given in Equation (7).

P r e c i s i o n_{i} = \frac{T P_{i}}{T P_{i} + F P_{i}} m P r e c i s i o n = \frac{1}{C} \sum_{i = 1}^{C} \frac{T P_{i}}{T P_{i} + F P_{i}}

(7)

Recall measures the completeness of positive instance detection and is formally defined as the proportion of correctly identified positives to all actual positives, as expressed in Equation (8). A higher recall value indicates that the model successfully detects a greater proportion of the target class. Consequently, recall serves as a critical measure of a model’s completeness in identifying a specific category.

R e c a l l_{i} = \frac{T P_{i}}{T P_{i} + F N_{i}} m R e c a l l = \frac{1}{C} \sum_{i = 1}^{C} \frac{T P_{i}}{T P_{i} + F N_{i}}

(8)

Consequently, precision and recall primarily evaluate model performance with respect to a specific category. In contrast, accuracy and mIoU offer holistic measures that reflect the overall classification effectiveness. These metrics are extensively adopted in the evaluation of segmentation methodologies, enabling rigorous quantitative comparison and in-depth analysis of segmentation outcomes.

4. Results and Analysis

This section presents the experimental results. The evaluation is designed to assess the performance of the proposed HMA-UNet and its core LCSA and LSFA modules through three primary analyses: (1) a comparative experiment pitting HMA-UNet against several typical segmentation models, including UNet [21], SegNet [23], HRNet [40], ViT [41], DeepLabv3+ [22], U-Net++ [42]; (2) an ablation study examining the contribution of the LCSA and LSFA modules by comparing them with alternative attention mechanisms; (3) the inference speed analysis of the above models; (4) and a comparison with the transformer based geospatial vision foundation model CrossEarth [43].

4.1. WHDLD Dataset

4.1.1. Comparative Benchmark Experiment

To assess the effectiveness of the proposed HMA-UNet, comprehensive comparative experiments were performed on the WHDLD dataset. Six state-of-the-art deep learning models, UNet, SegNet, HRNet, ViT, Deeplabv3+ and U-Net++ were implemented under identical experimental conditions for a fair comparison. All networks, including HMA-UNet, were initialized randomly without any transfer learning. All architectures demonstrated stable convergence during optimization. The quantitative results, including mean accuracy, mIoU, mPrecision, and mRecall, are summarized in Table 1, while qualitative visual comparisons are provided in Figure 5. The experimental outcomes demonstrate that HMA-UNet achieves superior performance across the evaluated metrics compared to the existing benchmark models.

As shown in Table 1, the proposed HMA-UNet outperforms all benchmark methods across all evaluated metrics. Specifically, HMA-UNet achieves an mAccuracy of 72.40%, an mIoU of 60.71%, an mPrecision of 75.46%, and an mRecall of 72.41% on the WHDLD dataset. This superior performance validates the effectiveness of the enhanced architecture in refining feature screening and integration. The original feature information from the encoder, delivered via skip connections, is retained and simultaneously supplemented by multi-dimensional features that are further harnessed from the shallow layers. This design enhances the delineation of edge details in segmentation tasks and reduces both false positives and missed detections for large objects commonly found in remote sensing imagery. In comparison, methods such as HRNet (which employs repeated cross-parallel convolutions for multi-scale fusion) and SegNet (which utilizes pooling indices) yield notably lower scores across all four metrics. The performance gap is particularly evident when compared to the baseline U-Net, which trails HMA-UNet by margins of 23.09% in mAccuracy, 26.00% in mIoU, 13.69% in mPrecision, and 23.09% in mRecall. The outcomes presented above validate the effectiveness of the proposed structural enhancements. These improvements are built on the foundational UNet architecture. Consequently, they significantly boost both the accuracy and robustness of segmentation in remote sensing image analysis.

Figure 5 provides a visual comparison of prediction results across different methods, with a focus on several representative scenarios: uniform water and vegetation, diverse urban morphologies, including areas with dispersed and dense building layouts. This qualitative examination serves to illustrate its segmentation capability. The advantages are reflected in its capability to preserve edge contours of water, vegetation, buildings, and roads, while simultaneously reducing both misclassification and omission errors. The figure comprises five sets of prediction results, each arranged in eight columns with the original image presented in Figure 5a. Figure 5b–h subsequently showcase the prediction results obtained from the following methods: Ground Truth, SegNet, HRNet, ViT, Deeplabv3+, U-Net++, and HMA-UNet, respectively. Through comparative visual assessment, it can be observed that the proposed method achieves more accurate segmentation of buildings, roads, vegetation, bare soil, and water bodies, delivering detailed and precise predictions across diverse scenarios.

As shown in the red box of the first image in Figure 5, the predictions from SegNet, Deeplabv3+, and U-Net++ manifest noticeable discontinuities and misclassification points. In the first image, although SegNet, HRNet, and Deeplabv3+ can recognize the existence of water, their segmentation results are not as complete as those of our proposed method. In the second image, due to the dense distribution of buildings, the predictions of SegNet, Deeplabv3+, and U-Net++ suffer from significant inaccuracies and visual ambiguity. In contrast, our proposed HMA-UNet, enhanced by the integration of LCSA and LSFA mechanisms, effectively mitigates background interference and emphasizes critical features. This leads to superior robustness, as evidenced in the third image containing extensive water bodies within vegetated areas. While HMA-UNet accurately delineates the vegetation, the other comparative models produce evident errors, as indicated within the pink annotations. Within the blue annotations of the fourth image, the segmentation results of SegNet and Deeplabv3+ misclassified some building areas as roads, whereas HMA-UNet performed well by accurately segmenting the buildings. Moreover, the HMA-UNet can accurately identify the small water area within the white box, while other methods cannot effectively identify the target.

In the fifth image, the proposed HMA-UNet can more accurately identify road targets enclosed in water areas, with clearer segmentation edges and smoother results, while the results of SegNet, HRNet, Deeplabv3+, and U-Net++ reveal obvious misclassification. Overall, the segmentation results produced by the proposed method demonstrate greater structural integrity and contour completeness. In contrast, HRNet and ViT exhibit noticeable misclassification errors, while the outcomes from SegNet, Deeplabv3+, and U-Net++ display incomplete segmentation of buildings, roads, and vegetation. The proposed DHAU-Net proves particularly effective in complex scenes, accurately delineating target objects while suppressing irrelevant interference, which underscores its comparative advantage over the other methods examined.

4.1.2. Ablation Experiment for the Attention Module

To further demonstrate the advantages of the proposed LCSA and LSFA modules, a comparative study is conducted with several widely used attention mechanisms, such as spatial attention (SA) [44], Nonlocal [30], bottleneck attention module (BAM), squeeze-and-excitation (SE), convolutional block attention module (CBAM), and selective kernel (SK), when integrated into the UNet architecture. This comparison aims to evaluate the contribution of different attention mechanisms to enhancing the network’s segmentation capability. All networks are trained from scratch under identical experimental conditions, with consistent hyperparameters and without employing any pre-trained models. As summarized in Table 2, the proposed LCSA and LSFA modules achieve superior quantitative results compared to the existing attention mechanisms. For a direct visual appraisal, the predicted results from various methods are juxtaposed in Figure 6.

Table 2 indicates that the SA and Nonlocal mechanisms are not well-suited for this segmentation task within the UNet framework, as evidenced by the suboptimal performance of the corresponding models on the evaluation metrics. In contrast, models incorporating the BAM, SE, and CBAM modules demonstrate comparable performance levels to one another, perform much better than the baseline UNet, UNet + SA, and UNet + Nonlocal model. Among them, the UNet + SK model achieves better results than the above three models. It outperforms the UNet + BAM, UNet + SE, and UNet + CBAM methods across all evaluated metrics. Specifically, the UNet + SK achieves an mAccuracy of 70.26%, an mIoU of 62.04%, an mPrecision of 74.26%, and an mRecall of 70.94% on the WHDLD dataset. Although the UNet + SK model shows better performance than others, it still performs slightly worse than our HMA-UNet, which trails HMA-UNet by margins of 2.96% in mAccuracy, 1.98% in mPrecision, and 2.03% in mRecall. The effectiveness of our LCSA and LSFA modules is further evidenced by the comparative experimental results, confirming their positive impact on this segmentation task.

Figure 6 presents five sets of prediction results, each arranged in ten columns. Figure 6a displays the original image, Figure 6b displays the ground truth, while Figure 6c–j show the prediction results from the following architectures: UNet, UNet + SA, UNet + Nonlocal, UNet + BAM, UNet + SE, UNet + CBAM, UNet + SK, and HMA-UNet, respectively.

Figure 6 presents a visual comparison of segmentation results from the baseline UNet, the proposed HMA-UNet, and six attention-based variants (SA, Nonlocal, BAM, SE, CBAM, and SK). In the first image, UNet + BAM, UNet + CBAM, UNet + SK, and HMA-UNet achieve comparable accuracy in vegetation segmentation. However, HMA-UNet better preserves the structural continuity of vegetation areas. The integration of multi-attention mechanisms consistently endows all four models with significantly enhanced robustness against noise interference in such contexts. In the second image, which includes sparse bare soil and road interference, UNet + BAM, UNet + CBAM, and UNet + SK demonstrate superior accuracy, yet lack the capability to segment bare soil regions completely. The third image, featuring dense and uniformly distributed buildings, reveals notably poor segmentation performance by UNet, UNet + SA, and UNet + SE. Although other methods detect most of the building targets, their boundaries are often blurred, and bare soil between buildings is frequently misclassified as building area. For the fourth image with sparsely distributed buildings, the proposed HMA-UNet delivers clearly superior segmentation results. In contrast, UNet, UNet + SA, and UNet + BAM produce fragmented and less coherent building segments. On road segmentation in this scene, UNet + SK and HMA-UNet perform notably better than the other compared methods. In the fifth image, UNet + SE yields the weakest road segmentation (indicated in the pink box), whereas other models perform adequately in this aspect. Additionally, UNet, UNet + SA, and UNet + Nonlocal fail to identify the bare soil region highlighted in the white box.

Overall, the HMA-UNet achieves consistently accurate segmentation across varied scenarios, including sparse or dense target distributions, unclear boundaries, and low inter-class spectral contrast. Its ability to selectively enhance relevant features plays a crucial role in obtaining robust segmentation results, especially under complex environmental interference.

Upon evaluating the aforementioned attention mechanisms, the CBAM and SK modules demonstrate notable effectiveness in remote sensing image segmentation tasks, particularly in handling large intra-class scale variations, complex background interference, and subtle inter-class differences. These two modules outperform other candidates, such as SE, BAM, Non-local, and SA, in terms of both quantitative metrics and qualitative visual fidelity. Thus, we integrate the CBAM and SK modules into the same architectural framework as the proposed LCSA and LSFA modules, constructing a model denoted as UNet + CBAM + SK. The segmentation results of this model are subsequently compared with those of HMA-UNet to further validate the efficacy of the proposed modules. The quantitative results, including mean accuracy, mIoU, mPrecision, and mRecall, are summarized in Table 3, while qualitative visual comparisons are provided in Figure 7.

Table 3 shows that the proposed HMA-UNet (UNet + LCSA + LSFA) performs better than UNet + CBAM + SK on most of the metrics, including mACC, mPrecision, and mRecall, but slightly worse on the mIoU metric. Specifically, the UNet + CBAM + SK model achieves an mAccuracy of 71.60%, an mIoU of 63.17%, an mPrecision of 75.46%, and an mRecall of 71.60% on the WHDLD dataset, which trails HMA-UNet by margins of 1.10% in mAccuracy, 0.40% in mPrecision, and 1.12% in mRecall. The comparison results further confirm the effectiveness of the proposed HMA-UNet.

The comparative results in Figure 7 are organized in seven groups of four rows each. From top to bottom, these rows correspond to Figure 7a the input image, Figure 7b the annotation mask, Figure 7c the prediction of UNet + CBAM + SK, and Figure 7d the prediction of HMA-UNet. It can be seen that in the pink box of the first image, the results of the UNet + CBAM + SK model show obvious breakpoints and discontinuities. In the second and third images, the road and bare soil extraction results of the HMA-UNet still perform better than the results of UNet + CBAM + SK, which are more continuous, smooth, and completely visible. In the pink box of the fourth image, the UNet + CBAM + SK model failed to identify small bodies of water, while the HMA-UNet model could identify them effectively. In the fifth and sixth images, the UNet + CBAM + SK model misclassifies some water and bare soil as vegetation and road, respectively, while the HMA-UNet performs better. In the seventh image, the HMA-UNet can recognize more accurate road pixels than the UNet + CBAM + SK model. In general, the segmentation results obtained with the LSFA + LCSA module (HMA-UNet) are usually more accurate, capturing more correct semantic information across categories. The predicted regions appear more complete, with better-defined boundaries and fewer instances of misclassification or fragmentation. These qualitative improvements further validate the effectiveness of the proposed method.

4.2. DeepGlobe Road Extraction Dataset

The transferability of HMA-UNet was assessed by training all models on the DeepGlobe Road Extraction Dataset. All models were kept identical in architecture, differing only in datasets. The corresponding quantitative results are provided in Table 4, and representative visual comparisons are shown in Figure 8.

The performance of the proposed HMA-UNet on the DeepGlobe Road Extraction Dataset is quantitatively compared with that of other methods across multiple metrics in Table 4. HMA-UNet achieves an accuracy of 57.87%, an IoU of 49.82%, a precision of 78.18%, and a recall of 57.87%. Compared to the original UNet, HMA-UNet demonstrates substantial improvements across most metrics. Specifically, UNet yields an accuracy of 15.21%, an IoU of 14.86%, a precision of 86.42%, and a recall of 15.21%. This indicates that HMA-UNet surpasses UNet by significant margins of 73.72% in accuracy, 70.17% in IoU, and 73.72% in recall. It is noted that UNet attains the highest precision of 86.42%, which is 9.53% higher than that of HMA-UNet. Other methods with relatively low overall accuracy, such as UNet + SA, UNet + Non-local, and SegNet, also outperform HMA-UNet in terms of precision. This can be attributed to the inherent trade-off in binary classification tasks, where a lower precision for one class often corresponds to a higher precision for the opposing class. Overall, both the UNet + CBAM + SK model and the proposed HMA-UNet deliver superior comprehensive performance compared to the other methods. Furthermore, HMA-UNet slightly outperforms the UNet + CBAM + SK model across the evaluated metrics. The capability of attention mechanisms to provide quantifiable improvements in the segmentation task is verified by these results, with the proposed LCSA and LSFA modules exhibiting the most favorable performance improvement.

Figure 8 illustrates two test images (Image 1 and Image 2), each subdivided into a 2 × 7 grid comprising 14 sub-figures. Labels (a) through (n) correspond to the segmentation results of the original image, ground truth, UNet, UNet + SA, UNet + Nonlocal, UNet + BAM, UNet + SE, UNet + CBAM, UNet + SK, SegNet, Deeplabv3+, U-Net++, UNet + CBAM + SK, and HMA-UNet, respectively.

Image 1 reveals a common limitation across all compared methods, where their segmentation results exhibit pronounced artifacts due to the interference from buildings, vegetation, vehicles, and bare soil. Specifically, certain road sections are misclassified as background. While most networks accurately identify the road in the lower-left region, notable errors persist in segmenting low-contrast road segments, which are incorrectly assigned to the background. The segmentation performance of UNet + CBAM + SK and HMA-UNet is comparable. However, HMA-UNet exhibits fewer misclassifications in the upper-right area.

In Image 2, the road segmentation performance is evaluated under conditions of extensive and uniformly distributed building coverage. Most networks fail to completely segment roads adjacent to uniformly distributed buildings, resulting in high error rates. Although UNet + BAM and UNet + SE successfully segment a majority of the road areas, partial omissions remain. U-Net + SK and U-Net + CBAM + SK demonstrate overall superior segmentation performance, accurately extracting most road sections; however, they are less effective than HMA-UNet in preserving fine details. On the DeepGlobe Road Extraction Dataset, the incorporation of the LCSA and LSFA modules into the U-Net enhances detail retention. The proposed modules improve focus on foreground road features while suppressing background interference and reducing noise impact.

4.3. Inference Speed Discussion

Figure 9 compares the inference time of each model under identical conditions, using frames per second (FPS) as the metric. On the WHDLD and DeepGlobe datasets, HMA-UNet exhibits a 42.45% and 40.46% increase in inference time compared to the original UNet, respectively. Nevertheless, it maintains a processing speed of over 260 FPS, which satisfies the requirements for real-time application. Given the significant improvement in segmentation performance achieved by HMA-UNet on remote sensing imagery, this additional computational overhead is considered acceptable. Furthermore, compared to the UNet + CBAM + SK model, HMA-UNet shows a slight improvement in inference time while delivering superior performance on the mAccuracy, mPrecision, and mRecall metrics. Overall, HMA-UNet demonstrates robust segmentation capability for urban remote sensing images, with the gain in performance well justified within a reasonable increase in inference cost. The results confirm that the proposed LCSA and LSFA modules provide an effective and practical means of enhancing network segmentation performance.

4.4. Comparison with CrossEarth

The quantitative comparison between the proposed HMA-UNet and the geospatial vision foundation model, CrossEarth, is summarized in Table 5. On the WHDLD dataset, HMA-UNet achieves superior performance in terms of both precision and recall. Specifically, it attains an IoU of 60.71% and a precision of 75.76%, marginally surpassing CrossEarth by 0.41% and 0.64%, respectively. On the DeepGlobe dataset, CrossEarth demonstrates superior segmentation accuracy, achieving an accuracy of 67.77%, an IoU of 78.85%, a precision of 82.25%, and a recall of 78.85%. In contrast, HMA-UNet attains an accuracy of 57.87%, IoU of 49.82%, precision of 78.18%, and recall of 57.87%. This performance gap can be attributed to the vastly greater model capacity of CrossEarth, which possesses 327.77 million parameters, enabling it to learn richer and more complex feature representations from large-scale pre-training.

However, this high accuracy comes at a significant computational cost. CrossEarth has a model size of approximately 1.25 GB and an inference speed of only 60–70 FPS on standard hardware. Such requirements hinder its deployment in resource-constrained scenarios, particularly for on-board satellite processing where power, memory, and computational budgets are strictly limited. In stark contrast, the proposed HMA-UNet exemplifies a lightweight design. With merely 15.43 million parameters, about one-twentieth of CrossEarth’s size and a compact model footprint of 58.87 MB, it achieves a remarkably high inference speed of 260–270 FPS. This efficiency makes it a compelling candidate for real-time or on-board Earth observation applications. Notably, on the WHDLD dataset, HMA-UNet achieves highly competitive performance relative to CrossEarth, HMA-UNet even slightly surpasses the foundation model in terms of IoU and precision, while matching it closely in accuracy and recall, despite its drastically smaller model size.

This comparison highlights a critical trade-off in remote sensing segmentation between pure accuracy and practical deployability. While large vision foundation models based on transformer structure, like CrossEarth, set the state-of-the-art in accuracy, their computational footprint is prohibitive for edge deployment. The proposed HMA-UNet offers a balanced alternative, delivering comparable and, on some datasets, superior precision with a fraction of the computational cost and model size. This design aligns with the growing trend towards lightweight, efficient algorithms for geospatial intelligence, proving that a carefully designed compact model can achieve a favorable balance between performance and practicality for specific operational contexts.

5. Conclusions

In this paper, we propose a novel hybrid multi-attention module and present HMA-UNet, a network architecture designed for the segmentation of buildings, roads, pavements, vegetation, bare soil, and water in remote sensing imagery. By integrating the proposed LCSA and LSFA modules in place of the original backbone in UNet, the model effectively suppresses redundant information interference while enhancing the pivotal features from early network layers. The capability of HMA-UNet and the contributions of the LCSA and LSFA modules are evaluated across three distinct scenarios using the WHDLD dataset and the DeepGlobe road extraction dataset. Across established evaluation metrics, the results reported for our method confirm its superior efficacy. Furthermore, our method demonstrates significant improvements in complex remote sensing image analysis: maintaining more fine-grained details and lowering error rates, as verified by the above experiments.

Author Contributions

Conceptualization, L.C. and C.L.; methodology, L.C. and Y.G.; software, L.C. and Y.C.; validation, C.L., Y.G. and Z.W.; formal analysis, L.C. and Y.G.; investigation, L.C. and Y.C.; data curation, L.C. and C.L.; writing—original draft preparation, L.C. and C.L.; writing—review and editing, X.M., Y.C. and S.J.; visualization, L.C. and S.J.; supervision, X.M., L.J. and Z.W.; project administration, X.M. and L.J.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Program of Safety Assurance Technology for Energy Transport Corridor of Shuohuang Railway Based on Advanced Sensing Technologies (Grant No. GJNY-24-30).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data used in this study were obtained from the WHDLD (https://sites.google.com/view/zhouwx/dataset#h.p_ebsAS1Bikmkd, accessed on 26 March 2019) released by the University of Wuhan in 2018, and the DeepGlobe Road Extraction Dataset (http://deepglobe.org/challenge.html, accessed on 18 June 2018) released by CVPR in 2018.

Conflicts of Interest

Authors Changliang Li and Yixuan Gao were employed by the company CASIC Space Engineering Development Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare that this study received funding from Program of Safety Assurance Technology for Energy Transport Corridor of Shuohuang Railway Based on Advanced Sensing Technologies (GJNY-24-30). The funder had the following involvement with the study: The funder provided financial support for the research.

References

Kim, M.; Kim, G. Modeling and Predicting Urban Expansion in South Korea Using Explainable Artificial Intelligence (XAI) Model. Appl. Sci. 2022, 12, 9169. [Google Scholar] [CrossRef]
Liu, J.; Ran, X.; Wang, X. Intelligent Landslide Susceptibility Assessment Framework Using the Swin Transformer Technique: A Case Study of Changbai County, Jilin Province, China. Appl. Sci. 2026, 16, 301. [Google Scholar] [CrossRef]
Reksten, J.H.; Salberg, A.-B. Estimating Traffic in Urban Areas from Very-High Resolution Aerial Images. Int. J. Remote Sens. 2021, 42, 865–883. [Google Scholar] [CrossRef]
Feng, Y.; Chen, B.; Liu, W.; Xue, X.; Liu, T.; Zhu, L.; Xing, H. Winter Wheat Mapping in Shandong Province of China with Multi-Temporal Sentinel-2 Images. Appl. Sci. 2024, 14, 3940. [Google Scholar] [CrossRef]
Zhang, T.; Yang, X.; Hu, S.; Su, F. Extraction of Coastline in Aquaculture Coast from Multispectral Remote Sensing Images: Object-Based Region Growing Integrating Edge Detection. Remote Sens. 2013, 5, 4470–4487. [Google Scholar] [CrossRef]
Amiriebrahimabadi, M.; Rouhi, Z.; Mansouri, N. A Comprehensive Survey of Multi-Level Thresholding Segmentation Methods for Image Processing. Arch. Comput. Methods Eng. 2024, 31, 3647–3697. [Google Scholar] [CrossRef]
Maulik, U.; Chakraborty, D. Remote Sensing Image Classification: A survey of support-vector-machine-based advanced techniques. IEEE Geosci. Remote Sens. Mag. 2017, 5, 33–52. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Tang, W.; Li, H. Robust Airport Surface Object Detection Based on Graph Neural Network. Appl. Sci. 2024, 14, 3555. [Google Scholar] [CrossRef]
Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote Sensing Object Detection in the Deep Learning Era—A Review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, T.; Wang, G.; Zhu, P.; Tang, X.; Jia, X.; Jiao, L. Remote Sensing Object Detection Meets Deep Learning: A metareview of challenges and advances. IEEE Geosci. Remote Sens. Mag. 2023, 11, 8–44. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Zhang, N.; Zhang, Y.; Zhao, Z.; Xu, D.; Ben, G.; Gao, Y. Deep Learning-Based Object Detection Techniques for Remote Sensing Images: A Survey. Remote Sens. 2022, 14, 2385. [Google Scholar] [CrossRef]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
Khan, B.A.; Jung, J.-W. Semantic Segmentation of Aerial Imagery Using U-Net with Self-Attention and Separable Convolutions. Appl. Sci. 2024, 14, 3712. [Google Scholar] [CrossRef]
Huang, L.; Jiang, B.; Lv, S.; Liu, Y.; Fu, Y. Deep-Learning-Based Semantic Segmentation of Remote Sensing Images: A Survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8370–8396. [Google Scholar] [CrossRef]
Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
Lv, J.; Shen, Q.; Lv, M.; Li, Y.; Shi, L.; Zhang, P. Deep learning-based semantic segmentation of remote sensing images: A review. Front. Ecol. Evol. 2023, 11, 1201125. [Google Scholar] [CrossRef]
Li, J.; Pei, Y.; Zhao, S.; Xiao, R.; Sang, X.; Zhang, C. A Review of Remote Sensing for Environmental Monitoring in China. Remote Sens. 2020, 12, 1130. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chang, Z.; Li, H.; Chen, D.; Liu, Y.; Zou, C.; Chen, J.; Han, W.; Liu, S.; Zhang, N. Crop Type Identification Using High-Resolution Remote Sensing Images Based on an Improved DeepLabV3+ Network. Remote Sens. 2023, 15, 5088. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Wei, Y.; Liu, X.; Lei, J.; Yue, R.; Feng, J. Multiscale feature U-Net for remote sensing image segmentation. J. Appl. Remote Sens. 2022, 16, 016507. [Google Scholar] [CrossRef]
Dong, R.; Pan, X.; Li, F. DenseU-Net-Based Semantic Segmentation of Small Objects in Urban Remote Sensing Images. IEEE Access 2019, 7, 65347–65356. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Park, J.; Woo, S.; Lee, J.-Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 510–519. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Li, R.; Duan, C.; Zheng, S.; Zhang, C.; Atkinson, P.M. MACU-Net for Semantic Segmentation of Fine-Resolution Remotely Sensed Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Lu, M.; Zhang, Y.; Du, X.; Chen, T.; Liu, S.; Lei, T. Attention-Based DSM Fusion Network for Semantic Segmentation of High-Resolution Remote-Sensing Images; Springer International Publishing: Cham, Switzerland, 2021; pp. 610–618. [Google Scholar]
Khanh, T.; Duy Phuong, D.; Ho, N.-H.; Yang, H.-J.; Baek, E.-T.; Lee, G.; Kim, S.H.; Yoo, S. Enhancing U-Net with Spatial-Channel Attention Gate for Abnormal Tissue Segmentation in Medical Imaging. Appl. Sci. 2020, 10, 5729. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C. Multistage Attention ResU-Net for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8007205. [Google Scholar] [CrossRef]
Yu, M.; Chen, X.; Zhang, W.; Liu, Y. AGs-Unet: Building Extraction Model for High Resolution Remote Sensing Images Based on Attention Gates U Network. Sensors 2022, 22, 2932. [Google Scholar] [CrossRef] [PubMed]
Xu, L.; Liu, Y.; Yang, P.; Chen, H.; Zhang, H.; Wang, D.; Zhang, X. HA U-Net: Improved Model for Building Extraction From High Resolution Remote Sensing Imagery. IEEE Access 2021, 9, 101972–101984. [Google Scholar] [CrossRef]
Shao, Z.; Zhou, W.; Deng, X.; Zhang, M.; Cheng, Q. Multilabel Remote Sensing Image Retrieval Based on Fully Convolutional Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 318–328. [Google Scholar] [CrossRef]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–17209. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Avcı, C.; Sertel, E.; Kabadayı, M.E. Deep Learning-Based Road Extraction From Historical Maps. IEEE Geosci. Remote Sens. Lett. 2022, 19, 3513605. [Google Scholar] [CrossRef]
Metlek, S. CellSegUNet: An improved deep segmentation model for the cell segmentation based on UNet++ and residual UNet models. Neural Comput. Appl. 2024, 36, 5799–5825. [Google Scholar] [CrossRef]
Gong, Z.; Wei, Z.; Wang, D.; Hu, X.; Ma, X.; Chen, H.; Jia, Y.; Deng, Y.; Ji, Z.; Zhu, X.; et al. Crossearth: Geospatial vision foundation model for domain generalizable remote sensing semantic segmentation. arXiv 2024, arXiv:2410.22629. [Google Scholar] [CrossRef] [PubMed]
Yu, Y.; Zhang, Y.; Cheng, Z.; Song, Z.; Tang, C. Multi-scale spatial pyramid attention mechanism for image recognition: An effective approach. Eng. Appl. Artif. Intell. 2024, 133, 108261. [Google Scholar] [CrossRef]

Figure 1. LCSA module structure.

Figure 2. LSFA module structure.

Figure 3. HMA-UNet architecture.

Figure 4. Training loss curve.

Figure 5. Visual comparison of segmentation results of HMA-UNet and contrastive methods on the WHDLD dataset. From left to right, (a) original image, (b) ground truth, (c) SegNet, (d) HRNet, (e) ViT, (f) Deeplabv3+, (g) U-Net++, and (h) HMA-UNet.

Figure 6. Ablation results for attention mechanisms on the WHDLD dataset: (a) input image, (b) annotation, (c) prediction from UNet, (d) prediction from UNet + SA, (e) UNet + Nonlocal, (f) UNet + BAM, (g) UNet + SE, (h) UNet + CBAM, (i) UNet + SK, and (j) HMA-UNet.

Figure 7. Visual results of the ablation study on UNet + CBAM + SK and HMA-UNet using the WHDLD dataset: (a) original image, (b) ground truth, (c) UNet + CBAM + SK, and (d) HMA-UNet.

Figure 8. Visualization of the prediction results of HMA-UNet and the other models: (a) input image, (b) annotation mask, (c) UNet, (d) UNet + SA, (e) UNet + Nonlocal, (f) UNet + BAM, (g) UNet + SE, (h) UNet + CBAM, (i) UNet + SK, (j) SegNet, (k) Deeplabv3+, (l) UNet++, (m) UNet + CBAM + SK, and (n) HMA-UNet.

Figure 9. Inference speed histogram.

Table 1. Performance comparison of HMA-UNet and benchmark models on the WHDLD dataset across quantitative evaluation metrics (%).

Methods	mACC	mIoU	mPrecision	mRecall
UNet	55.68	44.93	65.13	55.69
SegNet	57.93	46.07	65.68	61.21
HRNet	57.18	45.24	63.84	54.84
ViT	55.68	44.93	65.13	55.69
Deeplabv3+	66.03	55.00	73.96	66.53
U-Net++	51.63	39.51	63.64	49.84
HMA-UNet	72.40	60.71	75.46	72.41

Table 2. Quantitative results of the ablation study on attention mechanisms using the WHDLD dataset (%).

Methods	mACC	mIoU (%)	mPrecision (%)	mRecall (%)
UNet	55.68	44.93	65.13	55.69
UNet + SA	57.10	44.82	65.08	60.59
UNet + Nonlocal	52.86	41.15	66.55	55.92
UNet + BAM	65.93	55.06	71.60	65.59
UNet + SE	66.78	55.15	71.41	66.65
UNet + CBAM	66.92	55.57	71.76	66.92
UNet + SK	70.26	62.04	74.26	70.94
HMA-UNet	72.40	60.71	75.76	72.41

Table 3. Quantitative results of the ablation study on UNet + CBAM + SK and HMA-UNet using the WHDLD dataset (%).

Methods	mACC	mIoU	mPrecision	mRecall
UNet + CBAM + SK	71.60	63.17	75.46	71.60
HMA-UNet	72.40	60.71	75.76	72.41

Table 4. Quantitative results of HMA-UNet and baseline methods on the DeepGlobe Road Extraction Dataset (%).

Methods	ACC	IoU	Precision	Recall
UNet	15.21	14.86	86.42	15.21
UNet + SA	34.06	31.55	81.05	34.06
UNet + Nonlocal	31.10	29.12	82.03	31.10
UNet + BAM	41.92	37.61	78.51	41.92
UNet + SE	44.14	39.04	77.18	44.14
UNet + CBAM	39.06	33.62	70.70	39.06
UNet + SK	49.77	44.00	79.16	49.77
SegNet	29.41	27.43	80.27	29.41
HRNet	0.51	0.51	58.68	0.51
Deeplabv3+	19.61	18.39	74.68	19.61
U-Net++	27.84	25.82	77.67	27.89
UNet + CBAM + SK	57.75	49.72	78.15	57.75
HMA-UNet	57.87	49.82	78.18	57.87

Table 5. Quantitative results of HMA-UNet and the geospatial vision foundation model, CrossEarth.

Dataset	ACC	IoU	P	Recall	FPS	Size	Params
HMA-UNet
WHDLD	72.40	60.71	75.76	72.41	262	58.87 MB	15.43 M
DeepGlobe	57.87	49.82	78.18	57.87	268	58.87 MB	15.43 M
CrossEarth
WHDLD	75.26	60.46	75.28	75.26	66	1250.34 MB	327.77 M
DeepGlobe	67.77	78.85	82.25	78.85	70	1250.34 MB	327.77 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, L.; Li, C.; Gao, Y.; Chang, Y.; Jin, S.; Wang, Z.; Ma, X.; Jia, L. Research on Remote Sensing Image Object Segmentation Using a Hybrid Multi-Attention Mechanism. Appl. Sci. 2026, 16, 695. https://doi.org/10.3390/app16020695

AMA Style

Chen L, Li C, Gao Y, Chang Y, Jin S, Wang Z, Ma X, Jia L. Research on Remote Sensing Image Object Segmentation Using a Hybrid Multi-Attention Mechanism. Applied Sciences. 2026; 16(2):695. https://doi.org/10.3390/app16020695

Chicago/Turabian Style

Chen, Lei, Changliang Li, Yixuan Gao, Yujie Chang, Siming Jin, Zhipeng Wang, Xiaoping Ma, and Limin Jia. 2026. "Research on Remote Sensing Image Object Segmentation Using a Hybrid Multi-Attention Mechanism" Applied Sciences 16, no. 2: 695. https://doi.org/10.3390/app16020695

APA Style

Chen, L., Li, C., Gao, Y., Chang, Y., Jin, S., Wang, Z., Ma, X., & Jia, L. (2026). Research on Remote Sensing Image Object Segmentation Using a Hybrid Multi-Attention Mechanism. Applied Sciences, 16(2), 695. https://doi.org/10.3390/app16020695

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Remote Sensing Image Object Segmentation Using a Hybrid Multi-Attention Mechanism

Abstract

1. Introduction

2. Methodology

2.1. LCSA Module

2.2. LSFA Module

2.3. HMA-UNet Structure

3. Experiment

3.1. Experimental Data

3.2. Experimental Environment

3.3. Evaluation Metrics

4. Results and Analysis

4.1. WHDLD Dataset

4.1.1. Comparative Benchmark Experiment

4.1.2. Ablation Experiment for the Attention Module

4.2. DeepGlobe Road Extraction Dataset

4.3. Inference Speed Discussion

4.4. Comparison with CrossEarth

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI