Remote Sensing Image Segmentation Network That Integrates Global–Local Multi-Scale Information with Deep and Shallow Features

Chen, Nan; Yang, Ruiqi; Zhao, Yili; Dai, Qinling; Wang, Leiguang

doi:10.3390/rs17111880

Open AccessArticle

Remote Sensing Image Segmentation Network That Integrates Global–Local Multi-Scale Information with Deep and Shallow Features

by

Nan Chen

¹

,

Ruiqi Yang

²,

Yili Zhao

³,

Qinling Dai

⁴ and

Leiguang Wang

^1,*

¹

School of Landscape Architecture, Southwest Forestry University, Kunming 650224, China

²

School of Geography, Yunnan Normal University, Kunming 650224, China

³

College of Big Data and Intelligence Engineering, Southwest Forestry University, Kunming 650224, China

⁴

Art and Design College Engineering, Southwest Forestry University, Kunming 650224, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(11), 1880; https://doi.org/10.3390/rs17111880

Submission received: 28 April 2025 / Revised: 24 May 2025 / Accepted: 26 May 2025 / Published: 28 May 2025

(This article belongs to the Special Issue The Recent Progression of Machine Learning in Remote Sensing: Theory and Modelling)

Download

Browse Figures

Versions Notes

Abstract

As the spatial resolution of remote sensing images continues to increase, the complexity of the information they carry also grows. Remote sensing images are characterized by large imaging areas, scattered distributions of similar objects, intricate boundary shapes, and a high density of small objects, all of which pose significant challenges for semantic segmentation tasks. To address these challenges, we propose a Remote Sensing Image Segmentation Network that Integrates Global–Local Multi-Scale Information with Deep and Shallow Features (GLDSFNet). To better handle the wide variations in object sizes and complex boundary shapes, we design a Global–Local Multi-Scale Feature Fusion Module (GLMFM) that enhances segmentation performance by fully leveraging multi-scale information and global context. Additionally, to improve the segmentation of small objects, we propose a Shallow–Deep Feature Fusion Module (SDFFM), which effectively integrates deep semantic information with shallow spatial features through mutual guidance, retaining the advantages of both. Extensive ablation and comparative experiments conducted on two public remote sensing datasets, ISPRS Vaihingen and Potsdam, demonstrate that our proposed GLDSFNet outperforms state-of-the-art methods.

Keywords:

deformable convolution; feature fusion; multiscale feature; remote-sensing image; semantic segmentation

1. Introduction

Semantic segmentation plays a critical role in the processing of remote sensing (RS) imagery [1], as it involves assigning semantic category labels to individual pixels within an image. Recent advancements in artificial intelligence (AI) have substantially propelled geoscience research, with deep learning methods widely adopted for the analysis of RS data [2,3].

In particular, deep convolutional neural networks (CNNs) have become dominant in semantic segmentation tasks due to their powerful feature representation capabilities [4,5]. A landmark contribution by Long et al. [6] replaced traditional fully connected layers with fully convolutional layers, enabling dense, pixel-level predictions for images of arbitrary sizes. However, the resulting segmentation maps often suffer from coarse boundaries due to the loss of spatial location information. To address this, a variety of Fully Convolutional Network (FCN)-based architectures have been proposed. Representative models include U-Net [7], SegNet [8], and E-Net [9], all of which have demonstrated promising performance across a range of applications. U-Net, in particular, utilizes skip connections to merge features from different levels of the network, thereby mitigating information loss between the encoder and decoder stages. Nevertheless, skip connections typically link features from an encoder layer directly to a decoder layer with the same spatial resolution. This approach only allows information transfer at the same scale and fails to effectively integrate semantic information across different scales.

Motivated by this limitation, subsequent research has emphasized the importance of multi-scale contextual information. PSPNet [10] incorporates multiple parallel pyramid pooling layers to capture context at different scales. Similarly, DeepLabV3+ [11] introduces Atrous Convolution and the Atrous Spatial Pyramid Pooling (ASPP) module, enabling the network to extract features across various receptive fields and capture multi-scale context simultaneously. DenseASPP [12] enhances the model’s multi-scale perception ability by densely connecting atrous convolution layers with varying dilation rates, while avoiding an increase in model complexity. More recently, DMNet [13] proposes a dynamic convolution module equipped with adaptive multi-scale filters, enabling the network to adjust its receptive field according to the content of the input image, thereby improving segmentation performance in diverse scenarios.

Nevertheless, applying these models to high-resolution remote sensing (RS) imagery poses unique challenges. As illustrated in Figure 1a, significant variations exist in spectral features, object sizes, textures, and other aspects at both inter-class and intra-class levels. These discrepancies demand models with robust generalization capabilities. Existing methods often struggle with the distinct challenges inherent in RS images, which are characterized by pronounced scale variations and complex object boundaries. Unlike natural scene images, conventional multi-scale feature extraction approaches often fail to effectively capture the nuanced and hierarchical scale variations inherent in remote sensing imagery [14,15,16].

Additionally, as shown in Figure 1b, deep learning algorithms still face significant challenges in accurately detecting small-scale objects in remote sensing images [17,18].The features of such objects are often blurred and indistinct, making them difficult to differentiate and increasing the model’s sensitivity to noise during training, which in turn hampers recognition accuracy. Moreover, the limited receptive field of convolutional neural networks (CNNs), coupled with the scarcity of annotated samples for small objects, further constrains the model’s generalization capability [19]. Although CNNs have demonstrated success in many image processing tasks, their inherent limitation in modeling long-range dependencies restricts their ability to capture global semantic context. Accurately understanding pixel-level semantics requires both global context and local detail, motivating researchers to explore transformer-based approaches [20,21,22]. Zheng et al. [23] introduced the Segmentation Transformer (SETR), which uses a transformer architecture as the backbone and marks a significant advancement in segmentation tasks. However, the multi-head self-attention mechanism employed in transformers leads to considerable computational overhead; for example, the self-attention operation scales quadratically with the input size, resulting in FLOPs and memory consumption that can be up to 2–3 times higher than typical CNN-based models for high-resolution images [24]. This is particularly problematic for remote sensing applications, where images are often very large (e.g., thousands of pixels per side), requiring efficient processing to enable practical deployment. To address this, Liu et al. proposed the Swin Transformer [24], which leverages a shifted window-based attention mechanism to reduce the computational complexity from quadratic to linear with respect to image size, thereby significantly improving efficiency. Since then, numerous researchers have incorporated attention mechanisms or transformer modules to enhance segmentation accuracy by capturing richer global context information. These developments underscore the fact that both global context and local feature representations are crucial for semantic segmentation in RS images. Effectively integrating these two aspects can lead to more precise pixel-wise classification.

Inspired by the above literature, this paper proposes a Remote Sensing Image Segmentation Network that Integrates Global–Local Multi-Scale Information with Deep and Shallow Features (GLDSFNet). GLDSFNet contains two main modules: the Global–Local Multi-scale Feature Fusion Module (GLMFFM), which includes a Global Context Feature Extraction Module (GCFEM), and the Multi-scale Deformable Convolution Module (MDCM), which are used to effectively extract and fuse global and multi-scale context information. GLMFFM captures the global information of the image through an attention mechanism, effectively establishing remote dependency relationships. MDCM further expands the receptive field by employing deformable convolutions of multiple sizes. Unlike standard convolutions with fixed grid sampling locations, deformable convolutions adaptively adjust their sampling positions according to the shape of the objects in the image. This flexibility allows the network to align better with irregular and complex boundaries of remote sensing objects, thereby capturing more detailed and accurate boundary information [25]. Such shape-adaptive convolution has been demonstrated as effective in previous works. To effectively integrate shallow and deep features, A Shallow–Deep Feature Fusion Module (SDFFM) is proposed to fuse deep and shallow information in order to capture intricate spatial details and mitigate the loss of boundaries for small-scale objects. Specifically, SDFFM uses the abstract semantic information of the deep feature map to guide the selection of shallow spatial features accurately, and at the same time. The proposed method uses the precise spatial details of the shallow feature map to enhance the accurate selection of depth. The main contributions can be summarised as follows.

We propose a novel semantic segmentation network, GLDSFNet, which improves the recognition of objects at multiple scales by effectively integrating global context, multi-scale local features, shallow details, and deep semantics within the decoder.
We design the GLMFFM, composed of GCFEM and MDCM, to simultaneously extract and fuse global contextual information and multi-scale local features, enhancing the overall feature representation capacity.
We introduce the SDFFM, which enables mutual guidance between deep and shallow features. By preserving and fusing their complementary characteristics, SDFFM enhances the expressiveness of feature maps and boosts segmentation accuracy, especially for fine-grained object boundaries.

2. Related Work

2.1. CNN-Based Semantic Segmentation Methods

The objective of semantic segmentation in remote sensing imagery is to classify each pixel into its corresponding semantic category. As a fundamental task in computer vision, semantic segmentation lays the groundwork for the understanding and interpretation of remote sensing data [26]. It enables a wide range of applications, including land use classification [27], environmental monitoring [28], urban planning [29,30], and autonomous driving [31]. Compared to traditional pixel-based classification methods, CNNs demonstrate superior performance in semantic segmentation tasks due to their powerful feature learning capabilities. FCNs were the first to enable semantic segmentation on images of arbitrary sizes, significantly advancing the development of CNN-based methods. However, FCNs suffer from limitations, such as the loss of spatial information, insufficient fine-grained semantics, and high computational costs.

Subsequently, U-Net introduced an encoder–decoder architecture with skip connections, which helps preserve spatial details and improves semantic refinement, while also optimizing computational efficiency. Nonetheless, the skip connections in U-Net primarily fuse features at corresponding levels, lacking the capacity to effectively extract and integrate multi-scale features. This limitation is particularly critical in remote sensing (RS), where objects, such as buildings, roads, rivers, and vegetation, exhibit drastic variations in size, shape, and spatial distribution. For instance, a single RS image may contain both small vehicles and large buildings, which require different receptive fields for accurate recognition. Therefore, effectively capturing and integrating multi-scale features is essential to address these scale variations and improve segmentation accuracy and robustness. To address this limitation, the DeepLab series introduced the Atrous Spatial Pyramid Pooling (ASPP) module, which extracts features at multiple scales in parallel and fuses them to enrich the semantic representation. Building on this, models such as A2-FPN [32], MAResU-Net [33], and MANet [34] have further enhanced multi-scale feature aggregation by incorporating linear and dot-product attention mechanisms. While attention-based fusion methods have improved the discriminative power of features, a critical issue remains: the inherent differences between shallow and deep features are often neglected. Shallow features tend to contain rich spatial and positional information, such as fine-grained textures and edges at the pixel level. However, deep features encode more high-level, category-specific semantic information—for example, land cover patterns or object classes—that provide broader contextual understanding. Effectively integrating these complementary properties remains a pressing challenge in remote sensing image segmentation. This difficulty is exacerbated in remote sensing due to the large variation in object scales: small objects, like vehicles, require detailed spatial features for accurate detection, while large land-cover categories depend more heavily on abstract semantic cues. Balancing these diverse demands calls for sophisticated multi-scale and multi-level feature fusion strategies [35,36].

To address these issues, we propose a Multi-scale Deformable Convolution Module (MDCM) that expands the receptive field by employing deformable convolutions with varying kernel sizes. These deformable convolutions dynamically adjust the sampling locations of convolution kernels by learning offset parameters, enabling flexible adaptation to the geometric variations of objects in the imagery. Additionally, we introduce the Shallow–Deep Feature Fusion Module (SDFFM) to enhance the integration of shallow and deep features. SDFFM effectively captures intricate spatial details and reduces boundary information loss, especially benefiting the segmentation of small-scale objects. Unlike conventional fusion methods that typically use simple concatenation or summation, SDFFM selectively enhances shallow features guided by deep semantic cues, while simultaneously refining deep semantic representations with the spatial precision of shallow features. This bidirectional enhancement enables the module to better preserve fine structural and boundary details of small and densely distributed objects.

2.2. Transformer-Based Semantic Segmentation Methods

In recent years, the rise of Transformers in natural language processing and computer vision has driven their widespread application in semantic segmentation tasks. The Vision Transformer (ViT) [37] revolutionized image classification by replacing the traditional convolutional neural network architecture with a pure Transformer framework. Building on this, the SETR model employs a Transformer encoder to capture global contextual information. However, remote sensing images often possess extremely high spatial resolution and multiple spectral bands—for example, multispectral satellite images with resolutions up to 10 cm. The massive data volume causes the self-attention mechanism to incur computational and memory costs that grow quadratically with the number of pixels, significantly increasing hardware resource demands during training and inference. This limits the practical application of such techniques in resource-constrained environments. To address this, the Swin Transformer introduces a hierarchical design with non-overlapping sliding windows and local self-attention mechanisms, effectively reducing computational complexity while maintaining strong representational capacity for large-scale remote sensing images.

CNNs and Transformers each offer distinct advantages in visual processing. CNNs are adept at capturing local structures and textures through convolutional operations within limited receptive fields, while Transformers excel at modeling long-range dependencies via self-attention, enabling a deeper understanding of global semantics. To combine the strengths of both, numerous hybrid frameworks have been developed. Existing hybrid models for remote sensing image segmentation can be broadly categorized by their design philosophy, such as window-based and stripe-based approaches. For example, DC-Swin [38] leverages densely connected feature aggregation modules to effectively fuse multi-scale global features within window-based self-attention. UNetFormer [39] introduces a Global–Local transformer block that builds global relationships between windows while preserving local details via window-aligned pooling. Meanwhile, models like CMTFNet [40] and MTFNet [41] apply adaptive pooling to reduce feature map sizes, enabling efficient multi-scale global context extraction with reduced computational costs. Further extending this line of work, CMLFormer [42] employs Multi-scale Local-context Transform Blocks that capture both local and global features across scales using horizontal and vertical stripe convolutions, enhancing cross-scale interactions with lower complexity. Despite these advances, many existing methods still struggle to fully address the unique scale hierarchies and complex spatial distributions found in remote sensing images. In particular, their adaptability to drastic scale variations and dense, irregular object arrangements typical of remote sensing scenes remains limited. Therefore, designing multi-scale feature fusion strategies that better capture fine-grained spatial details and high-level semantic context—while maintaining computational efficiency—is essential for further progress in remote sensing image segmentation.

3. Materials and Methods

3.1. Overall Structure

The overall architecture of GLDSFNet is shown in Figure 2 and follows a classic encoder–decoder structure. The encoder uses a pre-trained ResNet-50 [43] as the backbone for feature extraction. In the decoder, 1 × 1 convolutions are applied to ResBlock1 through ResBlock4 to unify their channel dimensions to 256, resulting in four feature maps: S1, S2, S3, and S4. Among these, the deepest feature map, S4, is input into the GLMFFM module, which extracts and integrates global and multi-scale contextual information to produce M1. Then, M1 is fused with the shallower feature map S3 through the SDFFM module. This step enhances the integration of deep semantic features with shallow spatial details. The fusion continues hierarchically through the decoder, ultimately producing a final feature map enriched with global, local, and multi-scale semantic context.

3.2. Global–Local Multi-Scale Feature Fusion Module

Compared with natural images, remote sensing images typically cover broader geographic areas, and the objects of interest within them may appear at varying scales and locations. Therefore, it is essential to consider both scale variations and spatial distributions when processing such images. Additionally, due to the characteristics of significant intra-class variability and subtle inter-class differences in remote sensing imagery, it becomes crucial to effectively model global features and establish long-range contextual dependencies. At the same time, capturing multi-scale local information is also necessary to achieve a comprehensive understanding and analysis of image content.

To address these challenges, we propose the GLMFFM, whose architecture is illustrated in Figure 3. Within GLMFFM, GCFEM and MDCM operate in parallel to extract global contextual features and multi-scale local details, respectively. Their outputs are then fused through element-wise addition, enabling direct integration of global and local representations. This produces a unified feature map that facilitates more accurate analysis and interpretation of remote sensing images.

Specifically, in the GLMFFM (Global–Local Multi-scale Feature Fusion Module), the processing of the input feature map

X \in R^{B \times C \times H \times W}

begins with a channel expansion step. Here, B denotes the batch size, C the number of input channels, and H, W the spatial height and width respectively. To increase the representational capacity, the number of channels is expanded by a factor of 3 using a 1 × 1 convolution. This operation is computationally efficient and preserves spatial resolution, while allowing the model to learn richer feature representations across the expanded channels. Next, the expanded feature map is partitioned into non-overlapping windows of size

w s

ws through a window segmentation operation. This local windowing strategy limits self-attention computation to manageable spatial regions, significantly reducing memory consumption and computational cost compared to global self-attention, especially important for high-resolution remote sensing images. Each window is treated as an independent unit for self-attention. Within each window, three feature vectors are derived for each spatial location: the Query Q, Key K, and Value V. These are obtained through linear projections of the windowed features. The number of attention heads h is set to 8, enabling the model to attend to different subspaces of the feature representation simultaneously. Each attention head processes a feature subspace of dimension d = 64, chosen to balance the trade-off between expressiveness and computational efficiency.

Attention weights are computed by performing a dot product between the Query and Key vectors within each head to produce raw attention scores. These scores reflect the similarity or relevance between spatial positions within the window. The raw scores are then normalized via the softmax function to obtain a distribution of attention weights that sum to one, ensuring stable gradient flow and emphasizing the most relevant features. These normalized attention weights are then applied to the corresponding Value vectors through weighted summation, producing the attention output for each spatial position in the window. This process captures context-dependent feature interactions within the window, enhancing the representation of both local and contextual information. After computing attention outputs for all windows, a rearrangement (or “reverse window partitioning”) operation is performed to restore the spatial structure of the feature map to its original resolution H × W. This step reassembles the windowed outputs into a continuous feature map, enabling subsequent layers to process a unified spatial representation. Finally, a dimensionality reduction operation—typically implemented by a 1 × 1 convolution—is applied to reduce the expanded channel dimension back to a manageable size for downstream processing. This step controls model complexity and computational cost, ensuring that the fused features can be efficiently integrated into later network modules.

The hyperparameters (e.g., window size ws = 8, number of attention heads h = 8, and per-head dimension d = 64) are selected based on a balance between capturing sufficient spatial context and maintaining computational feasibility, particularly important in remote sensing imagery where feature maps tend to be large.

Due to the large-scale variations and complex boundary shapes of ground objects in remote sensing images, traditional standard convolution operations often struggle to effectively identify and capture these complex shapes. To address this, the MDCM employs customized multi-scale convolution kernels to adaptively extract features and shapes of ground objects, enhancing the model’s ability to recognize large-scale changes and accurately delineate object boundaries. Figure 4 illustrates the application of deformable convolution in remote sensing images.

Deformable convolution consists of two parts: ordinary convolution and the deformable branch. The deformable branch implements the primary step of deformation, which learns the offset of the coordinates in each convolution kernel. Formally, let us denote a 3 × 3 kernel. For each position on the output feature map y at p0, the operation of standard convolution can be described as follows:

y (p 0) = \sum_{p_{n} \in R} W (p_{n}) \cdot x (p_{0} + p_{n})

(1)

where

p_{n}

is the position in R,

x (p_{0} + p_{n})

is an arbitrary location on the input feature map, and

W (p_{n})

represents the weight of the convolution kernel.

Deformable convolution introduces an offset for each point based on standard convolution. For

p_{0}

at each position on the output feature map y, the operation of deformable convolution can be described as follows:

y (p 0) = \sum_{p_{n} \in R} W (p_{n}) \cdot x (p_{0} + p_{n} + ∆ p_{n})

(2)

where

Δ p_{n}

denotes the offset. However, since the offset may be non-integer and does not correspond to the actual pixel points on the feature map, Equation (2) is realized by bilinear interpolation as follows:

x (p) = \sum_{q} G (q \cdot p) \cdot x (q)

(3)

where

p

represents an arbitrary location

(p = p_{0} + p_{n} + Δ p_{n})

in Equation (2),

q

enumerates all the integral space positions in the feature map x, and

G (q \cdot p)

is the bilinear interpolation kernel. After applying the bilinear interpolation method, the standard backpropagation training is finally realized.

3.3. Shallow–Deep Feature Fusion Module

In remote sensing (RS) image segmentation, multi-scale feature integration is essential due to the vast diversity in object sizes and complex spatial distributions. Shallow features capture fine-grained spatial details, like edges and textures, which are crucial for accurate object localization. Deep features, on the other hand, represent high-level semantic information that helps in understanding the overall scene context. However, traditional fusion strategies, such as simple addition or channel concatenation, often struggle to effectively combine these heterogeneous features, resulting in a semantic gap and suboptimal utilization of their complementary information.

Although attention-based models, such as A2-FPN, MAResU-Net, and MANet, have made significant progress in multi-scale feature aggregation by emphasizing important features, they often tend to over-rely on either shallow or deep features, lacking a balanced fusion strategy. To address this limitation, we propose the SDFFM, specifically designed to bridge the semantic gap between feature hierarchies and refine feature representations. As illustrated in Figure 5, SDFFM achieves a synergistic fusion of spatial details from shallow layers and semantic abstractions from deep layers. Specifically, SDFFM uses the abstract semantic information from deep features to accurately guide the selection of shallow spatial features, while simultaneously leveraging the precise spatial details from shallow features to enhance the effective utilization of deep semantic features. This optimized feature aggregation significantly improves the model’s ability to interpret complex scenes and leads to enhanced segmentation performance.

Specifically, global average pooling is first applied to the deep feature map to generate channel attention weights (CAW), which can be expressed as follows:

F_{C A W} = F_{S i g m o i d} (F_{C o n v 1 \times 1} (F_{Re l u} (F_{C o n v 1 \times 1} (F_{A v g p o o l} (x)))))

(4)

where

F_{A v g p o o l}

represents the global average pooling function,

F_{C o n v 1 \times 1}

represents the 1 × 1 convolution layer,

F_{Re l u}

represents the Relu activation function,

F_{S i g m o i d}

represents the sigmoid function, and

x

represents the input data.

Spatial attention weights (SPW) are generated by applying global average pooling along the channel dimension of shallow feature maps, followed by a 7 × 7 convolutional layer and sigmoid activation. The process can be mathematically represented as follows:

F_{S P W} = F_{S i g m i o d} (F_{B N} (F_{C o n v 7 \times 7} (F_{A v g p o o l}^{h} (x) \oplus F_{A v g p o o l}^{w} (x)))

(5)

where

F_{A v g p o o l}^{w}

and

F_{A v g p o o l}^{h}

represent the application of the global average pooling function in the directions of w and h,

F_{C o n v 7 \times 7}

represents 7 × 7 convolution layer,

F_{B N}

represents the BatchNorm function, and

\oplus

represents the concatenation operator.

Subsequently, the channel attention weights (CAWs) are multiplied element-wise with the shallow feature map. This strategy enables the abstract semantic information from the deep feature map to guide the selection of relevant spatial features in the shallow feature map, thereby enhancing its semantic representation capability. Simultaneously, the spatial attention weights (SPWs) are multiplied element-wise with the deep feature map. This operation facilitates precise extraction of spatial information from shallow features while directing the selection of abstract features in the deep feature map, thus improving their spatial representation. Finally, the weighted deep and shallow feature maps are summed to produce fused features that eliminate redundant information while preserving the unique strengths associated with each scale. The SDFFM can be mathematically formulated as follows:

F_{S D F F M} = F_{C A W} (x_{1}) \otimes x_{2} \oplus F_{S P W} (x_{2}) \otimes x_{1}

(6)

where

\otimes

represents the dot multiply operator,

F_{C A W}

represents the CAW weight in Equation (4), and

F_{S P W}

represents the SP weight in Equation (5).

4. Results and Discussion

4.1. Dataset and Evaluation Metrics

The Potsdam dataset consists of aerial images of Potsdam, Germany, covering both urban and rural areas. It contains 38 orthophoto images with a GSD of 5 cm and a size of 6000 × 6000. Each image contains four bands: near infrared (IR), red (R), green (G), and blue (B), as well as the corresponding DSM and NDSM data. The dataset contains six categories: impervious surfaces, buildings, low vegetation, trees, cars, and clutter/background. In the experiments, we only use RGB image data for training and testing. We use 24 of the 38 images for training, 2 for validation, and 12 for testing.

The Vaihingen dataset consists of aerial images of the city of Vaihingen, Germany. It includes 33 aerial images of the city and its surrounding area. Each image has a resolution (GSD) of 5 cm/pixel and a size of 6000 × 6000 pixels. In addition to RGB color images, the dataset also includes images in the near infrared (IR) band. In addition, the dataset also provides Digital Surface Model (DSM) and Normalized Digital Surface Model (NDSM) data to represent surface elevation information. Its classification categories are the same as for the Potsdam dataset. In this experiment, we only use the TOP image block. Of the 33 images, 15 are used for training, 2 for validation, and 16 for testing.

In order to comprehensively evaluate the performance of segmentation, we selected commonly used remote sensing segmentation evaluation indexes, including Overall Accuracy (OA), mean Intersection over Union (mIoU), and F1 score (F1), with the formula as follows:

O A = \frac{T P + T N}{F P + F N + T P + T N} p r e c i s i o n = \frac{T P}{T P + F P} r e c a l l = \frac{T P}{F N + T P} F 1 = 2 \times \frac{p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l} I o U = \frac{T P}{F N + F P + T P}

(7)

where TP is true positive, TN is true negative, FP is false positive, and FN is the number of false negative pixels in the result. At the same time, in order to evaluate the complexity of the model, we selected the number of floating point operations per second (FLOPS) and the number of network parameters.

4.2. Experimental Details

The experiments in this paper are implemented based on the PyTorch1.9.0 framework and conducted on an NVIDIA RTX A4000. To ensure fair comparison, all models use a pre-trained ResNet-50 as the backbone network. During training, the SGD optimizer is employed with an initial learning rate of 0.01 and a weight decay of 0.001, and the learning rate is adjusted using a cosine annealing strategy. Data augmentation includes random flipping and rotation. The input image size for all datasets is set to 512 × 512. Large remote sensing images are divided into patches using a sliding window with a stride of 128 pixels. The patches are not resized to avoid distortion caused by scaling, ensuring the model learns features at their true scale. For the Potsdam and Vaihingen datasets, the training epoch is set to 70, and the batch size is 4.

4.3. Ablation Experiments

4.3.1. Effectiveness of Each Module of GLDSFNet

In order to quantitatively evaluate the effectiveness of each module of GLDSFNet, we conducted corresponding ablation experiments on the Vaihingen dataset, and the experimental results are shown in Table 1. We selected the Resnet50 as the baseline network.

Effectiveness of GLMFFM: To verify the effectiveness of the proposed global and multi-scale modules, we conducted ablation experiments by constructing two variants: Baseline + MDCM (with the global branch removed) and Baseline + GCFEM (with the multi-scale module removed). The results, shown in Table 1, indicate that introducing GCFEM on the Vaihingen dataset improved OA, mIoU, and average F1 by 1.3%, 5.42%, and 4.39%, respectively, while introducing MDCM led to increases of 1.57%, 6.56%, and 5.18%, respectively. When both modules are combined, segmentation performance further improves by 1.9%, 6.79%, and 5.33%, confirming the complementary nature of GCFEM and MDCM in feature extraction. Although the integration of the GLMFFM module increases the computational cost—adding 248 MFLOPs and 21.4 M parameters—the performance gains in terms of segmentation accuracy are substantial. Compared to prior work, our approach achieves significantly better accuracy with a moderate increase in complexity, offering a favorable trade-off between model performance and computational cost.

To further verify the advantages of GLMFFM in extracting and integrating Global–Local contextual information, we conducted comparative experiments with other advanced Global–Local contextual information modules. The experimental results are shown in Table 2. We chose GLTB and MLTB as models for comparison with GLMFFM, and conducted ablation experiments on the Vaihingen dataset. The results show that, compared to the second-best performing MLTB, our GLMFFM increased OA, mIoU, and average F1 by 0.49%, 1.25%, and 0.87%, respectively, on the Vaihingen dataset. Despite the higher complexity of GLMFFM, it effectively utilizes global information and multiscale local information of the image, enhancing the expressiveness of features and thus achieving better performance. These results further confirm the advantages of the GLMFFM in extracting and integrating Global–Local contextual information. GLMFFM can more comprehensively understand and analyze remote sensing images, enabling the network to achieve higher accuracy and generalization ability in remote sensing image analysis tasks. Although the introduction of GLMFFM increases the complexity of the network, its significant contribution to performance improvement makes it a deep learning method with broad application prospects.

Effectiveness of SDFFM: In order to verify the effectiveness of the proposed multi-scale feature aggregation module, we first put it on the Baseline alone. As shown in Table 1, after embedding SDFFM, the OA, mIoU and average F1 on the Vaihingen dataset increased by 1.8%, 5.93% and 4.71%, respectively, verifying the effectiveness of SDFFM. At the same time, in order to verify whether SDFFM and GLSTM can be effectively combined, SDFFM was embedded into Baseline + GLMFFM in Baseline + GLMFFM, forming MCFNet (Baseline + GLMFFM + SDFFM). As shown in Table 1, the OA, mIoU and average F1 on the Vaihingen dataset increased by 0.28%, 0.3% and 0.18%, respectively, after embedding SDFFM. After adding SDFFM, the parameter amount of the model only increased by 4 FLOPs. In order to further prove the advantages of the proposed SDFFM in multi-scale feature fusion, we conducted ablation experiments with other advanced multi-scale feature modules. Finally, according to Table 3, we can see that our module is better than other advanced modules, which indicates that SDFFM takes full advantage of the advantages of deep and shallow features in their fusion.

4.3.2. Effectiveness of GLMFFM

To determine the optimal configuration of the GLMFFM modules, we conducted multiple experiments on the Vaihingen dataset. Initially, a single GLMFFM module was integrated into the shallow layers of the decoder, and the number of modules was then gradually increased to four. The experimental results (see Table 4) show that the network achieves the best performance when four modules are incorporated. This design is based on the four feature maps of different scales extracted during the decoding phase, with each GLMFFM module applied to a corresponding feature map, enabling effective fusion and enhancement of multi-scale information. The GLMFFM module can capture global contextual information of the image, such as semantic relationships and long-range dependencies, while also extracting local context, such as textures and edges. Through multi-scale modeling, detail preservation is balanced with large-scale perception. The collaborative use of multiple modules significantly improves the network’s representational capacity, segmentation performance, and generalization ability in complex remote sensing scenarios.

4.4. Comparative Experiments

To evaluate the performance of GLDSFNet, it was compared with a variety of typical and recent segmentation methods. This included two classic methods: FCN and U-Net; three methods based on multiscale feature fusion: FPN, PSPNet, and Deeplabv3+; three attention-based multiscale aggregation networks: MANet, MAResUNet, and A2-FPN; and three transformer-based methods: DC-Swin, UNetFormer, and CMLFormer.

4.4.1. Experimental Results on the ISPRS Vaihingen

As shown in Table 5,the experimental results on the ISPRS Vaihingen dataset show that our proposed GLDSFNet achieves the best overall performance, with a MeanF1 of 84.47%, mIoU of 76.75%, and OA of 91.38%, which are 0.58%, 0.90%, and 0.64% higher than the second-best method, respectively. In terms of IoU for each category, GLDSFNet also performs well, reaching 81.60% for impervious surfaces, 88.33% for buildings, 66.68% for low vegetation, 75.46% for trees, and 66.16% for cars. These values exceed those of the second-best CMTFNet by 1.36%, 0.51%, 1.08%, 0.79%, and 0.47%, respectively. For F1 scores, GLDSFNet also achieves the highest results for cars and low vegetation, with scores of 79.63% and 80.01%, surpassing UNetFormer by 0.34% and 0.78%, and exceeding MANet and PSPNet by 1.72% and 2.28%, respectively. These significant improvements on small-scale targets demonstrate the effectiveness of our proposed Shallow–Deep Feature Fusion Module, which efficiently integrates high-resolution shallow features rich in local detail with deeper semantic features, enabling more accurate boundary localization and better preservation of fine structures, thus improving segmentation of multi-scale targets.

To visually compare the segmentation results of different algorithms, we present the segmentation results on the Vaihingen dataset, as shown in Figure 6, where each image has a size of 512 × 512. Classic CNN algorithms, such as U-Net and PSPNet, primarily focus on local features, making it challenging to ensure object completeness and accurate edge localization. Attention-based multi-scale feature aggregation algorithms, such as A2-FPN and MANet, enhance multi-scale feature aggregation by introducing linear attention and dot-product attention mechanisms. Although feature fusion methods incorporating attention enhance the discrimination of features, they fail to fully consider the inherent characteristics of shallow and deep features, resulting in poor segmentation of small objects. Transformer-based algorithms, such as UNetFormer and CMLFormer, utilize Transformers to extract global contextual information, enabling long-range dependency modeling. However, they often neglect multi-scale local features, resulting in inaccurate edge localization. GLDSFNet overcomes this limitation through the MDCM module. The deformable convolutions allow the network to adaptively focus on irregular shapes and object boundaries, thereby improving edge localization accuracy.

To visually demonstrate the advantages of our proposed method, we also compared the experimental results on the Vaihingen dataset. As shown in Figure 7, GLDSFNet significantly outperforms other algorithms in segmentation performance, particularly in the areas marked by red boxes. When dealing with buildings in remote sensing images, which often vary greatly in size and have irregular edges, our method not only effectively reduces misclassification within the same category but also precisely captures the boundary contours of objects. For small-scale targets, such as densely parked vehicles, GLDSFNet similarly demonstrates exceptional performance, clearly distinguishing the subtle edge gaps between vehicles, fully showcasing its efficiency and accuracy in handling multi-scale targets.

4.4.2. Experimental Results on the ISPRS Potsdam

To evaluate the generalization ability of GLDSFNet, we conducted extensive comparative experiments on the ISPRS Potsdam dataset. Table 6 presents the performance of various segmentation methods on this dataset. As shown, GLDSFNet outperforms all competing methods across key evaluation metrics, achieving the highest OA of 91.06%, MF1 of 90.79%, and mIoU of 83.40%. Compared to the second-best method, MANet, GLDSFNet improves these three metrics by 0.55%, 0.51%, and 0.84%, respectively, demonstrating its strong competitiveness in high-precision remote sensing image segmentation tasks.

In terms of segmentation performance for specific categories, GLDSFNet also demonstrates stable and significant advantages. Whether for large objects, such as impervious surfaces with an F1 score of 92.88% and IoU of 86.66%, and buildings with an F1 score of 96.57% and IoU of 93.45%, or for smaller and more complex categories, such as low vegetation with an F1 score of 86.21% and IoU of 75.67%, trees with an F1 score of 86.01% and IoU of 75.46%, and cars with an F1 score of 92.35% and IoU of 85.74%, GLDSFNet achieves the highest metrics. Especially in small object categories, GLDSFNet surpasses the second-best method UNetFormer by 0.74% in F1 score for both cars and low vegetation, demonstrating its superior capability in capturing fine details. Traditional CNN methods, such as FCNs and U-Net, are limited by their receptive fields and local feature representations, resulting in insufficient boundary localization. Multi-scale feature fusion methods, like FPN and PSPNet. show some improvement but still perform poorly in small object segmentation. Attention-based methods, such as MAResU-Net and MANet, enhance feature aggregation but there is still room for improvement in integrating information across different layers.

In contrast, GLDSFNet effectively extracts and integrates both global and local multi-scale information, enhancing feature representation while accurately identifying boundary shapes of objects at different scales. By fully considering the inherent characteristics of shallow and deep features, this method performs especially well in segmenting small-scale objects and, after fusing multi-scale features, the overall segmentation performance is significantly improved.

To provide a more intuitive demonstration of the segmentation performance of different algorithms, we present several challenging image segmentation results on the Potsdam dataset. As shown in Figure 7, the selected test images include red boxes highlighting different objects. The red boxes in the first and second sets contain objects with complex shapes, such as trees and low vegetation. The red box in the third set highlights targets with significant scale differences, including buildings and vehicles. Compared to other algorithms, the segmentation results of GLDSFNet still perform excellently in these complex target segmentation tasks. Particularly in the regions marked by red boxes, GLDSFNet accurately captures the object boundaries and achieves a higher level of distinction between different object categories. This indicates that GLDSFNet, while extracting global contextual information, effectively integrates multi-scale local features, further enhancing its ability to handle objects of various scales.

Overall, the GLDSFNet method successfully handles the segmentation tasks of objects with varying scales and complexities in the ISPRS Potsdam dataset, demonstrating excellent performance in detail capture and boundary localization. Particularly in more challenging localized regions, it achieves better segmentation results. Compared to other existing methods, GLDSFNet shows significant advantages in multi-scale object segmentation and boundary localization.

4.5. Discussion

We present the heatmaps generated by the baseline ResNet50, ResNet50 + SDFFM, Baseline + GLMFFM, and Baseline + GLMFM + SDFFM (Figure 8). These heatmaps illustrate how the two models, respectively, identify pixels belonging to buildings, trees, and cars. After adding SDFFM to the baseline, the model shows more activated (high-value) areas. This is especially noticeable for small-scale objects. This indicates that SDFFM effectively integrates shallow and deep features. Secondly, after incorporating GLMFFM, the model not only provides more useful global semantics on an overall map scale but also shows that the contours of the activated areas are closer to the actual boundaries of the objects. This demonstrates that the fusion of Shallow–Deep features and the integration of Global–Local multi-scale information can extract more local details and global information. The visualization results further confirm that the designed GLMFM and SDFFM can more effectively extract and integrate Shallow–Deep features and Global–Local multi-scale information, thereby achieving better semantic segmentation performance.

5. Conclusions

This paper proposes a novel semantic segmentation network for remote sensing images that effectively addresses challenges posed by complex object shapes and scale variations. The GCFEM is designed to capture comprehensive global contextual information and model long-range dependencies, which are crucial for accurately understanding large-scale spatial relationships in remote sensing scenes. Meanwhile, the MDCM incorporates multiple deformable convolution kernels that adaptively handle irregular and diverse object boundaries, enabling precise localization of complex shapes and targets of varying sizes. To further enhance segmentation accuracy, the SDFFM facilitates effective fusion of high-resolution shallow features with semantically rich deep features. Extensive experiments on the high-resolution ISPRS Vaihingen and ISPRS Potsdam datasets demonstrate that the proposed method significantly improves segmentation performance.

However, the model’s relatively high computational cost may limit its real-time performance on ultra-high-resolution images. Additionally, segmentation accuracy in scenes with densely packed small objects, such as vehicles in parking lots, still leaves room for improvement. Future work will focus on optimizing the network to reduce model complexity and improve inference speed, as well as enhancing the capability to accurately segment dense small targets. Moreover, we plan to explore the applicability of this network in related tasks, such as remote sensing image classification and object detection.

Author Contributions

The following contributions were made to this research effort: Conceptualization, N.C. and L.W.; methodology, R.Y. and Q.D.; software, N.C., Y.Z. and L.W.; validation, N.C.; writing—original draft preparation, N.C. and R.Y.; writing—review and editing, N.C. and L.W.; funding acquisition, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China under Grant No. 32160369, the Fundamental Research Project of Yunnan Province under Grant No. 202501AS070090 and the Ten Thousand Talent Plans for Young Top-notch Talents of Yunnan Province under Grant No. YNWR-QNBJ-2019-026.

Data Availability Statement

The Potsdam and Vaihingen datasets in this study are openly and freely available at https://www.isprs.org/education/benchmarks/UrbanSemLab/default.aspx (accessed on 5 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, X.; Yong, X.; Li, T.; Tong, Y.; Gao, H.; Wang, X.; Xu, Z.; Fang, Y.; You, Q.; Lyu, X. A Spectral–Spatial Context-Boosted Network for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2024, 16, 1214. [Google Scholar] [CrossRef]
Xu, Y.; Bai, T.; Yu, W.; Chang, S.; Atkinson, P.M.; Ghamisi, P. AI Security for Geoscience and Remote Sensing: Challenges and Future Trends. IEEE Geosci. Remote Sens. Mag. 2023, 11, 60–85. [Google Scholar] [CrossRef]
Ajibola, S.; Cabral, P. A Systematic Literature Review and Bibliometric Analysis of Semantic Segmentation Models in Land Cover Mapping. Remote Sens. 2024, 16, 2222. [Google Scholar] [CrossRef]
Xiang, S.; Xie, Q.; Wang, M. Semantic Segmentation for Remote Sensing Images Based on Adaptive Feature Selection Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8006705. [Google Scholar] [CrossRef]
Wang, L.; Dong, S.; Chen, Y.; Meng, X.; Fang, S.; Fei, S. MetaSegNet: Metadata-Collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5644211. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation; Spring: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2017; pp. 2881–2890. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 833–851. [Google Scholar]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. DenseASPP for Semantic Segmentation in Street Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. [Google Scholar]
He, J.; Deng, Z.; Qiao, Y. Dynamic Multi-Scale Filters for Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3561–3571. [Google Scholar]
Wang, G.; Zhai, Q.; Lin, J. Multi-Scale Network for Remote Sensing Segmentation. IET Image Process. 2022, 16, 1742–1751. [Google Scholar] [CrossRef]
Wang, L.; Zhang, C.; Li, R.; Duan, C.; Meng, X.; Atkinson, P.M. Scale-Aware Neural Network for Semantic Segmentation of Multi-Resolution Remote Sensing Images. Remote Sens. 2021, 13, 5015. [Google Scholar] [CrossRef]
Li, S.; Yan, F.; Liu, Y.; Shen, Y.; Liu, L.; Wang, K. A Multi-Scale Rotated Ship Targets Detection Network for Remote Sensing Images in Complex Scenarios. Sci. Rep. 2025, 15, 2510. [Google Scholar] [CrossRef]
Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep Learning Based Feature Selection for Remote Sensing Scene Classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
Ren, Y.; Zhu, C.; Xiao, S. Small Object Detection in Optical Remote Sensing Images via Modified Faster R-CNN. Appl. Sci. 2018, 8, 813. [Google Scholar] [CrossRef]
Meng, W.; Shan, L.; Ma, S.; Liu, D.; Hu, B. DLNet: A Dual-Level Network with Self- and Cross-Attention for High-Resolution Remote Sensing Segmentation. Remote Sens. 2025, 17, 1119. [Google Scholar] [CrossRef]
Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408820. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Springenberg, J.T.; Riedmiller, M.; Brox, T. Discriminative Unsupervised Feature Learning with Convolutional Neural Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 1, Bali, Indonesia, 8–12 December 2021; MIT Press: Cambridge, MA, USA, 2014; Volume 1, pp. 766–774. [Google Scholar]
Pereira, G.A.; Hussain, M. A Review of Transformer-Based Models for Computer Vision Tasks: Capturing Global Context and Spatial Relationships. arXiv 2024, arXiv:2408.15178. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Xu, X.; Yang, Z.; Zhang, S.; Li, S.; Luo, G.; Xu, Y. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 6881–6890. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Hwang, G.; Jeong, J.; Lee, S.J. SFA-Net: Semantic Feature Adjustment Network for Remote Sensing Image Segmentation. Remote Sens. 2024, 16, 3278. [Google Scholar] [CrossRef]
Temenos, A.; Temenos, N.; Kaselimi, M.; Doulamis, A.; Doulamis, N. Interpretable Deep Learning Framework for Land Use and Land Cover Classification in Remote Sensing Using SHAP. IEEE Geosci. Remote Sens. Lett. 2023, 20, 8500105. [Google Scholar] [CrossRef]
Bielecka, E.; Markowska, A.; Wiatkowska, B.; Calka, B. Sustainable Urban Land Management Based on Earth Observation Data—State of the Art and Trends. Remote Sens. 2025, 17, 1537. [Google Scholar] [CrossRef]
Zhu, W.; He, W.; Li, Q. Hybrid AI and Big Data Solutions for Dynamic Urban Planning and Smart City Optimization. IEEE Access 2024, 12, 189994–190006. [Google Scholar] [CrossRef]
Haack, B.; Bryant, N.; Adams, S. An Assessment of Landsat MSS and TM Data for Urban and Near-Urban Land-Cover Digital Classification. Remote Sens. Environ. 1987, 21, 201–213. [Google Scholar] [CrossRef]
Li, D.; Zhang, J.; Liu, G. Autonomous Driving Decision Algorithm for Complex Multi-Vehicle Interactions: An Efficient Approach Based on Global Sorting and Local Gaming. IEEE Trans. Intell. Transp. Syst. 2024, 25, 6927–6937. [Google Scholar] [CrossRef]
Li, R.; Wang, L.; Zhang, C.; Duan, C.; Zheng, S. A2-FPN for Semantic Segmentation of Fine-Resolution Remotely Sensed Images. Int. J. Remote Sens. 2022, 43, 1131–1155. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C. Multistage Attention ResU-Net for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8009205. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607713. [Google Scholar] [CrossRef]
Zhu, S.; Zhang, B.; Wen, D.; Tian, Y. NCSBFF-Net: Nested Cross-Scale and Bidirectional Feature Fusion Network for Lightweight and Accurate Remote-Sensing Image Semantic Segmentation. Electronics 2025, 14, 1335. [Google Scholar] [CrossRef]
Cheng, Y.; Wang, W.; Zhang, W.; Yang, L.; Wang, J.; Ni, H.; Guan, T.; He, J.; Gu, Y.; Tran, N.N. A Multi-Feature Fusion and Attention Network for Multi-Scale Object Detection in Remote Sensing Images. Remote Sens. 2023, 15, 2096. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6506105. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sensg 2022, 190, 196–214. [Google Scholar] [CrossRef]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and Multiscale Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2004612. [Google Scholar] [CrossRef]
Wang, X.; Jiang, B.; Wang, X.; Luo, B. MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection. arXiv 2021, arXiv:2112.01177. [Google Scholar]
Wu, H.; Zhang, M.; Huang, P.; Tang, W. CMLFormer: CNN and Multiscale Local-Context Transformer Network for Remote Sensing Images Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 7233–7241. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]

Figure 1. The main challenges in semantic segmentation of remote sensing images include: (a) significant variations in morphology and size among objects such as buildings, vegetation, and vehicles; (b) The frequent occurrence of boundary information loss during segmentation for small-scale targets like vehicles and tree patches due to their limited spatial distribution characteristics.

Figure 2. Structure of the proposed GLDSFNet.

Figure 3. Global-local Multi-scale Feature Fusion Module.

Figure 4. Deformable Convolution on Remote Sensing Image.

Figure 5. Shallow–Deep Feature Fusion Module.

Figure 6. Visualization of experimental results on the ISPRS Vaihingen dataset.

Figure 7. Visualization of experimental results on the ISPRS Potsdam dataset.

Figure 8. Six heat map samples. Subgraphs (a–c) show how to determine if a pixel belongs to a Building, Car, and Tree, respectively.

Table 1. Quantitative results of each module in the method in this paper on the Vaihingen dataset.

Method	Params (M)	FLOPs (GFLOPs)	OA (%)	Mean F1 (%)	MIoU (%)
Resnet50 (baseline)	23.52	21	85.46	80.35	68.55
Baseline + SDFFM	23.73	25	87.26	85.06	74.48
Baseline + GCFEM	25.56	30	86.76	84.74	73.97
Baseline + MDCM	44.35	129	87.03	85.53	75.11
Baseline + GLMFFM	44.92	131	87.36	85.68	75.34
Baseline + GLMFFM + SDFFM	45.41	135	87.64	85.86	75.64

Table 2. Ablation experiments on the Vaihingen dataset with different Global–Local modules.

Module	Params (M)	LOPs (GFLOPs)	OA (%)	Mean F1 (%)	mIoU (%)
Resnet50 + GLTB	28.24	44	84.74	83.75	73.97
Resnet50 + MLTB	26.17	32	86.87	84.81	74.09
Renet50 + GLMFFM	44.92	269	87.36	85.68	75.34

Table 3. Ablation experiments on the Vaihingen dataset with different feature fusion modules.

Module	Params (M)	LOPs (GFLOPs)	OA (%)	Mean F1 (%)	mIoU (%)
Resnet50 + FPN	23.60	24	86.74	84.83	74.10
Resnet50 + AAM	47.67	90	87.01	84.90	74.23
Resnet50 + SDFFM	23.73	25	87.26	85.06	74.48

Table 4. GLMFFM on the Vaihingen dataset on the network at different stages of the quantitative results.

GLMFFM1	GLMFFM2	GLMFFM3	GLMFFM4	Params (M)	FLOPs (GFLOPs)	OA (%)	Mean F1 (%)	mIoU (%)
√				30.09	107	87.31	85.62	74.84
√	√			35.02	128	87.35	85.44	75.01
√	√	√		40.31	133	87.45	85.72	75.41
√	√	√	√	45.41	135	87.64	85.86	75.64

Table 5. The experimental results on the ISPRS Vaihingen dataset.

Method	Per-Class FI Score (%)/Per-Class loU Score (%)					Indicators
Method	Imp_Sur	Build	Low_veg	Tree	Car	OA (%)	MF1 (%)	MIoU (%)
FCNs	87.08/77.11	92.43/85.92	77.68/63.50	84.65/73.38	59.96/42.82	80.35	85.46	68.55
U-Net	87.82/78.28	91.59/84.49	77.78/63.64	84.55/73.23	72.50/56.86	82.84	85.60	71.29
FPN	88.83/79.91	93.16/87.20	78.74/64.93	85.33/74.42	78.06/64.02	84.83	86.73	74.10
PSPNet	88.20/78.89	92.83/86.62	78.88/65.13	85.41/74.53	76.46/61.89	84.36	86.53	73.41
DeepLabV3+	88.85/79.94	93.08/87.06	78.58/64.72	85.29/74.35	76.85/62.40	84.49	86.61	73.64
MAResU-Net	85.12/74.09	92.99/86.90	78.40/64.48	85.47/74.63	78.60/64.74	84.84	86.62	74.10
MANet	88.83/79.90	92.99/86.89	78.68/64.85	85.62/74.85	78.91/65.17	85.00	86.78	74.33
A2-FPN	88.89/80.00	93.29/87.43	78.60/64.75	85.45/74.60	77.43/63.17	84.73	86.77	74.00
UNetFormer	89.04/80.24	93.34/87.51	79.23/65.60	85.50/74.67	79.29/65.69	85.28	87.00	74.74
CMLFormer	88.99/80.17	93.25/87.36	78.57/64.71	85.42/74.55	77.35/63.07	84.72	86.77	73.97
GLDSFNet	89.87/81.60	93.80/88.33	80.01/66.68	86.01/75.46	79.63/66.16	85.86	87.64	75.64

Table 6. Experimental results on the ISPRS Potsdam dataset.

Method	Per-Class FI Score (%)/Per-Class loU Score (%)					Indicators
Method	Imp_Sur	Build	Low_veg	Tree	Car	OA (%)	MF1 (%)	MIoU (%)
FCNs	88.73/79.65	93.09/86.98	81.57/68.85	81.26/68.40	80.95/67.96	85.08	86.65	74.37
U-Net	90.91/83.35	94.63/89.72	90.49/82.67	83.35/71.42	91.59/84.50	88.31	88.98	80.33
FPN	90.73/83.07	94.92/90.39	83.76/72.10	83.72/71.91	88.59/79.53	88.25	88.90	79.40
PSPNet	91.96/85.18	95.94/92.35	84.74/73.53	85.41/74.57	89.85/81.59	89.61	90.18	81.45
DeepLabV3+	92.15/85.45	90.45/82.62	85.32/74.45	85.25/74.38	90.41/82.50	89.88	90.39	81.88
MAResU-Net	92.45/86.09	96.37/93.07	85.56/74.70	85.06/74.05	91.55/84.37	90.21	90.55	82.46
MANet	92.25/85.73	96.26/92.95	85.58/74.71	85.21/74.20	91.99/85.19	90.28	90.51	82.56
A2-FPN	92.12/85.44	96.15/92.43	85.24/74.27	84.69/73.50	91.16/83.87	89.89	90.24	81.90
UNetFormer	92.30/85.79	96.25/92.94	85.47/74.57	84.97/73.89	91.61/84.46	90.13	90.47	82.33
CMLFormer	91.91/85.12	96.15/92.45	85.21/74.24	84.71/73.53	91.15/83.86	89.45	89.85	82.12
GLDSFNet	92.88/86.66	96.57/93.45	86.21/75.67	86.01/75.46	92.35/85.74	90.79	91.06	83.40

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, N.; Yang, R.; Zhao, Y.; Dai, Q.; Wang, L. Remote Sensing Image Segmentation Network That Integrates Global–Local Multi-Scale Information with Deep and Shallow Features. Remote Sens. 2025, 17, 1880. https://doi.org/10.3390/rs17111880

AMA Style

Chen N, Yang R, Zhao Y, Dai Q, Wang L. Remote Sensing Image Segmentation Network That Integrates Global–Local Multi-Scale Information with Deep and Shallow Features. Remote Sensing. 2025; 17(11):1880. https://doi.org/10.3390/rs17111880

Chicago/Turabian Style

Chen, Nan, Ruiqi Yang, Yili Zhao, Qinling Dai, and Leiguang Wang. 2025. "Remote Sensing Image Segmentation Network That Integrates Global–Local Multi-Scale Information with Deep and Shallow Features" Remote Sensing 17, no. 11: 1880. https://doi.org/10.3390/rs17111880

APA Style

Chen, N., Yang, R., Zhao, Y., Dai, Q., & Wang, L. (2025). Remote Sensing Image Segmentation Network That Integrates Global–Local Multi-Scale Information with Deep and Shallow Features. Remote Sensing, 17(11), 1880. https://doi.org/10.3390/rs17111880

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Remote Sensing Image Segmentation Network That Integrates Global–Local Multi-Scale Information with Deep and Shallow Features

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based Semantic Segmentation Methods

2.2. Transformer-Based Semantic Segmentation Methods

3. Materials and Methods

3.1. Overall Structure

3.2. Global–Local Multi-Scale Feature Fusion Module

3.3. Shallow–Deep Feature Fusion Module

4. Results and Discussion

4.1. Dataset and Evaluation Metrics

4.2. Experimental Details

4.3. Ablation Experiments

4.3.1. Effectiveness of Each Module of GLDSFNet

4.3.2. Effectiveness of GLMFFM

4.4. Comparative Experiments

4.4.1. Experimental Results on the ISPRS Vaihingen

4.4.2. Experimental Results on the ISPRS Potsdam

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI