NCSBFF-Net: Nested Cross-Scale and Bidirectional Feature Fusion Network for Lightweight and Accurate Remote-Sensing Image Semantic Segmentation

Zhu, Shihao; Zhang, Binqiang; Wen, Dawei; Tian, Yuan

doi:10.3390/electronics14071335

Open AccessArticle

NCSBFF-Net: Nested Cross-Scale and Bidirectional Feature Fusion Network for Lightweight and Accurate Remote-Sensing Image Semantic Segmentation

¹

School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, China

²

Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan 430205, China

³

School of Geographical Sciences, Lingnan Normal University, Zhanjiang 524048, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1335; https://doi.org/10.3390/electronics14071335

Submission received: 28 February 2025 / Revised: 21 March 2025 / Accepted: 24 March 2025 / Published: 27 March 2025

(This article belongs to the Special Issue Image Processing: From Datasets to Segmentation, Classification and Detection)

Download

Browse Figures

Versions Notes

Abstract

Semantic segmentation has emerged as a critical research area in Earth observation. This paper proposes a novel end-to-end semantic segmentation network, the Nested Cross-Scale and Bidirectional Feature Fusion Network (NCSBFF-Net), to address issues such as intra-class heterogeneity, inter-class homogeneity, scale variability, and the classification of tiny objects. Specifically, a CNN-based lightweight feature pyramid module is employed to extract contextual information across multiple scales, thereby addressing intra-class heterogeneity. The NCSBFF module leverages features from both shallow and deep layers and is designed to fuse multi-scale features, thereby enhancing inter-class semantic differences. Additionally, the shallowest feature is passed to the Shuffle Attention block in the NCSBFF module, which adaptively filters out weak details and highlights critical information for the classification of tiny objects. Extensive experiments were conducted on the Potsdam and Vaihingen benchmarks. Experiment results demonstrate that the NCSBFF-Net outperforms state-of-the-art methods, achieving a better trade-off between accuracy and efficiency, with a 5% improvement in mIoU significantly enhancing the recognition capability of small and complex objects, such as vehicles and irregular land parcels, in challenging scenes, and a 1.73% increase in accuracy demonstrating a better balance between computational efficiency and segmentation accuracy, providing an optimized solution for deployment on edge devices.

Keywords:

semantic segmentation; lightweight network; multi-scale feature; remote sensing; feature fusion

1. Introduction

Rapid advancements in aerospace and sensor technologies for Earth observation have significantly increased access to high-quality remote sensing images. The effective interpretation of these data can provide valuable insights into managing the ecological environment and monitoring human activities [1,2]. Specifically, the semantic segmentation of remote sensing images, which performs pixel-wise classification by assigning specific land cover/land use types to each pixel, has garnered significant attention in image interpretation tasks. Semantic segmentation of remote sensing images is applied in various real-world scenarios, including urban planning [3,4,5], disaster assessment [6,7], and agricultural management [8,9].

Numerous methods for semantic segmentation of remote sensing images have been developed. Before the advent of deep learning, pixel-wise classification was performed using classifiers such as Support Vector Machines (SVM), Random Forests (RF), decision trees, and K-nearest neighbors. Among these, SVM and RF have received significant attention in remote sensing image classification due to their superiority in classification performance, efficiency in handling-high dimensional data, and independence from data distribution, compared to other conventional classifiers [10].

Traditional semantic segmentation approaches require manual setting of parameters (e.g., the choice of an appropriate kernel in SVM and the number of trees in RF), use handcrafted operators for feature extraction, and suffer from low segmentation accuracy [11]. In recent years, to achieve high-precision segmentation results, many researchers have turned to semantic segmentation algorithms based on Convolutional Neural Networks (CNNs) [12,13,14]. Compared to traditional methods, CNNs have substantially improved the segmentation accuracy of remote sensing images. The Fully Convolutional Network (FCN) [15] was the first prominent CNN-based semantic segmentation network, replacing fully connected layers with convolutional layers to generate contextual features. However, the FCN does not consider pixel relationships during up-sampling. In contrast, the U-Net network, proposed by Ronneberger et al. (2015) [16], uses an encoder to generate deep semantic information, a decoder to recover spatial details, and residual connections to integrate detailed features from multiple scales in the encoder into decoder [17], thereby integrating more feature information than the FCN. The encoder–decoder structure is the widely used method for multi-scale context modeling. Other representative models also employ this structure, such as RefineNet [18] and HRNet [19]. In Diakogiannis et al. (2019) [20], the U-Net encoder and decoder served as the backbone, with residual connections, atrous convolutions, and pyramid scene parsing pooling combined. Based on the attention mechanism [20], Li et al. (2022) refactored the skip connections in the original U-Net and designed a multistage attention ResU-Net for the semantic segmentation of high-resolution remote sensing images [21]. In addition to these U-Net variants with two symmetric paths, Yang et al. (2021) proposed a multi-path encoder structure for feature extraction, along with two feature fusion blocks for fusing multi-path and multi-level features, which improve classification accuracy for boundaries between target objects [22]. Zhao et al. (2016) proposed the PSPNet network, which employs a pyramidal pooling structure that separates the feature map into multiple levels and sub-regions [23].

Overall, deep learning-based methods have achieved substantial success in the semantic segmentation of remote sensing images. The primary challenges include high intra-class variance (i.e., objects within the same category, such as buildings, may exhibit various shapes, textures, colors, scales, and structures), low inter-class variance (i.e., objects from different classes may share similar visual properties, such as buildings and impervious surfaces) [24], large scale variability (e.g., buildings and cars have varied scales, and even within the same category, such as buildings, exhibit different sizes) [25], and, in particular, the presence of very tiny and small objects (e.g., cars in urban scenes) [26]. Given these challenges, it is crucial to examine design choices for efficient feature extraction and fusion to improve segmentation accuracy and computational efficiency. In this paper, a novel lightweight and effective semantic segmentation model is presented, namely, the Nested Cross-Scale and Bidirectional Feature Fusion Network (NCSBFF-Net). Our main contribution is summarized as follows:

A new multi-scale feature fusion module, i.e., the NCSBFF module, is proposed for feature fusion to improve segmentation accuracy for scale-varied objects.
The shallowest feature is passed to the Shuffle Attention block in the NCSBFF module, which adaptively filters out weak details and highlights critical information for classification of tiny objects.
The proposed method improves state-of-the-art segmentation performance in both efficiency and accuracy on the Potsdam and Vaihingen benchmark dataset.

The remainder of this paper is organized as follows: Section 2 presents the related work. Section 3 introduces our proposed methodology and the employed modules. Section 4 experimentally validates the proposed method on the Potsdam and Vaihingen datasets. Finally, conclusions are drawn in Section 5.

2. Related Work

2.1. Lightweight Networks

Lightweight networks have become increasingly significant in remote sensing image semantic segmentation due to their ability to maintain high performance while reducing computational costs and memory consumption. As remote sensing images typically involve large datasets and require extensive processing, optimizing network efficiency is crucial for real-time applications.

In remote sensing image semantic segmentation tasks, CNN backbone networks such as EfficientNet [27,28], ResNet50 [29,30], VGG16 [31,32], and MobileNet [33,34] are widely used for multi-scale feature extraction. These networks are designed to strike a balance between accuracy and efficiency by leveraging strategies such as depthwise separable convolutions (MobileNet), residual connections (ResNet), and optimized scaling (EfficientNet). For instance, MobileNet, recognized for its lightweight architecture, employs depthwise separable convolutions to reduce the number of parameters and computational load, without significantly compromising accuracy. Recent studies such as [35] have adopted MobileNetV3 for semantic segmentation tasks in remote sensing, demonstrating that the network’s efficiency enables high-quality segmentations even in computationally constrained environments. ResNet50, with its residual learning framework, facilitates the training of very deep networks by alleviating the vanishing gradient problem, a common issue in deeper CNN architectures. This enables ResNet50 to effectively capture fine-grained features for segmentation tasks, even in the presence of complex scenes. EfficientNet employs a compound scaling method that simultaneously scales up network depth, width, and resolution, thereby improving performance while minimizing computational resources [36]. This optimization is crucial for remote sensing image segmentation, where large-scale input images demand efficient processing techniques. Some models focus on exploring deeper and wider neural networks to handle larger input image sizes. Empirically, deeper networks enhance the capacity for nonlinear interpretation by extracting more abstract and complex features; however, they are more susceptible to issues such as gradient explosion and vanishing gradients [37]. Wider networks capture more diverse information, such as textures in different directions and frequencies [37]. Additionally, larger input image dimensions expand the network’s receptive field, although they increase computational cost and memory consumption [38]. Therefore, balancing network depth, width, and input image size is essential for achieving optimal accuracy and efficiency. Tan and Le (2019) employed neural architecture search to optimize both accuracy and FLOPS in the development of EfficientNet, which consists primarily of Mobile Inverted Bottleneck Convolution (MBConv) blocks [39].

Several recent works in remote sensing semantic segmentation have combined lightweight modules to reduce computation costs, such as the Ghost module for real-time semantic segmentation of aerial river ice image [40], the integration of a lightweight CNN model (i.e., EfficientNetV2) with a transformer model for tree canopy segmentation [41], and a cross- and self-attention based lightweight network for building semantic segmentation [33]. Therefore, the use of lightweight modules is crucial.

2.2. Multi-Scale Feature Fusion

Low-level features capture detailed information, such as boundaries and locations, but provide less semantic depth. High-level features contain more abstract, discriminative, and target-oriented semantic information, but they lose some spatial resolution [25]. Thus, the fusion of low-level and high-level features strikes a balance between rich semantics with coarse resolution and detailed information with less semantic depth. Multi-scale feature fusion is a vital technique in the semantic segmentation of remote sensing images due to the substantial variations in object size, resolution, and contextual information present in satellite and aerial imagery. Objects of interest, such as buildings, roads, and vegetation, appear at varying scales depending on their geographical context, making multi-scale fusion crucial for accurately capturing and classifying features at different scales. Recent advancements have focused on enhancing the integration of multi-scale features through more efficient architectures, emphasizing both accuracy and computational efficiency.

The conventional Feature Pyramid Network (FPN) [42] is a pioneering method for multi-scale fusion in semantic segmentation. FPN employs a top-down structure to fuse multi-scale features generated through bottom-up feature extraction, propagating from high-level, coarse-resolution semantic features to low-level, high-resolution detailed features through lateral connections. While FPN is effective, its limitation lies in its relatively simplistic fusion strategy, which may fail to capture long-range dependencies in complex scenes. Numerous improvements have been suggested to enhance the FPN structure. For instance, to further enhance the information flow between high- and low-level features, the Path Aggregation Network (PANet) incorporates an additional bottom-up structure [43], improving both segmentation accuracy and small object detection. Additionally, advancements such as BiFPN (Bidirectional Feature Pyramid Network) [44] have introduced efficient cross-scale fusion strategies that facilitate improved multi-scale feature integration. Although these methods have demonstrated considerable performance improvements, their integration into remote sensing image semantic segmentation remains an ongoing area of research, particularly for enhancing the segmentation of small and irregularly shaped objects in complex environments. Another crucial method for multi-scale fusion is Atrous Spatial Pyramid Pooling (ASPP), first introduced in the DeepLab series [45]. ASPP employs atrous convolutions at multiple rates to capture features across different scales without compromising spatial resolution. This enables the network to efficiently integrate contextual information at varying scales, which is particularly advantageous in remote sensing applications where objects of interest span multiple scales. Recent work by Zhao et al. (2023) introduced Dual ASPP [46], which combines different rates of atrous convolutions more effectively, demonstrating its effectiveness in multi-scale road segmentation in complex scenes due to the extraction of both detailed local features and broader contextual information. The success of such hierarchical feature extraction is further validated in [47], where residual atrous spatial pyramid modules (RASPM) enhance multi-scale context capture while preserving spatial details for small objects. The concept of attention mechanisms has recently been incorporated into multi-scale fusion strategies to enhance semantic segmentation in remote sensing. Pyramid Attention Networks (PAN) [48] combine pyramid structures with attention mechanisms, enabling the model to focus on key regions of interest across multiple scales. This facilitates more precise feature fusion across various scales by weighting features according to their relevance for segmentation tasks. A further refinement, Attention-based Multi-scale Feature Fusion (AMFF) [49], enhances PAN by incorporating multi-level attention mechanisms across different feature scales. Building on these principles, ref. [50] introduces a modality-aware dynamic aggregation module (MDAM) that dynamically integrates saliency-related cues from parallel streams, demonstrating how attention mechanisms can be tailored to handle cross-modal dependencies in complex environments.

In summary, the complementary advantages of rich semantics at coarse resolution and detailed information at lower semantic levels facilitate the extraction of more comprehensive and identifiable features. Building upon this, we will further explore the potential of multi-scale feature fusion in the semantic segmentation of remote sensing images.

3. Materials and Methods

3.1. Overview

As shown in Figure 1, the proposed network architecture comprises three main components: (1) the EfficientNetV2-s backbone for feature extraction; (2) the Nested Cross-Scale and Bidirectional Feature Fusion (NCSBFF) module; and (3) the segmentation head. In the first stage, a CNN-based lightweight feature pyramid module is used to extract contextual information across multiple scales. The second component involves the NCSBFF module, which fuses multi-scale features. Finally, a lightweight CNN is employed to resize the fused features to the original image dimensions and generate a pixel-wise land cover classification map.

3.2. Lightweight Feature Pyramid Module

EfficientNet, developed by [36], is regarded as the model that achieves optimal accuracy and efficiency by balancing network depth, width, and input image dimensions. It primarily consists of Mobile Inverted Bottleneck Convolution (MBConv) blocks [39]. As illustrated in Figure 2, an MBConv block consists of a 1 × 1 convolution to expand the input’s channel dimension, a depthwise convolution to extract spatial features, a Squeeze and Excitation (SE) layer to reweight feature channels, and another 1 × 1 convolution to project features to a lower dimension. The Fused-MBConv block (Figure 2) replaces the depthwise convolution with a standard 3 × 3 convolution to improve hardware efficiency, as depthwise convolutions have fewer parameters than standard convolutions. In EfficientNetV2, MBConv and Fused-MBConv are combined to optimize training speed without increasing the number of parameters. Specifically, Tan and Le (2021) proposed EfficientNetV2_S [51], which incorporates Fused-MBConv in the early layers, uses smaller expansion ratios for MBConv, adopts smaller kernel sizes for convolutions, and removes the final stride-1 stage. To alleviate the computational burden and ensure a better trade-off between accuracy and efficiency, EfficientNetV2_S, the most lightweight model in the EfficientNetV2 family, was selected as the lightweight feature pyramid module for this study.

The EfficientNetV2_S backbone begins with a stem layer, consisting of a 3 × 3 convolutional layer for preliminary feature extraction. It then employs a sequence of Fused-MBConv modules in the shallow layers (Stages 1–3) and MBConv modules in the deeper layers (Stages 4–6) for multi-scale feature extraction, as shown in Table 1. The hierarchical architecture of EfficientNetV2_S enables the extraction of multi-scale information, which can subsequently be adapted for downstream feature fusion. In this study, the output features from Stages 2, 3, 5, and 6, with dimensions 128 × 128 × 48, 64 × 64 × 64, 32 × 32 × 160, and 16 × 16 × 256, respectively, are selected as the multi-scale features (

F_{1}

,

F_{2}

,

F_{3}

, and

F_{4}

). These features are selected based on the following considerations: First, the spatial dimensions of feature Fi should be divisible by those of feature

F_{i + 1}

to facilitate efficient upsampling or downsampling across different scales. Second, for features with the same spatial dimensions, such as those in Stages 1 and 2 (128 × 128) or in Stages 4 and 5 with (32 × 32), features extracted from deeper layers are preferred due to their ability to capture higher-level information.

3.3. Nested Cross-Scale and Bidirectional Feature Fusion Module

To enhance the discriminative power of the features, we introduce a nested cross-scale and bidirectional feature fusion module designed to propagate semantic information across different levels, namely,

F_{1} \in R^{128 \times 128 \times 48}, F_{2} \in R^{64 \times 64 \times 64}, F_{3} \in R^{32 \times 32 \times 160}, F_{4} \in R^{16 \times 16 \times 256}

. First, to facilitate the direct addition of multi-scale features, we employ four 1 × 1 convolutions to adjust each feature map to the same channel dimension, setting the kernel numbers to 128. This value of 128 was chosen because it is close to the average number of channels across the four feature maps. These adjusted features can then be fused directly, thereby enriching both detailed and semantic information. Since the information flow between high- and low-level features is more effective in a two-way, top-down and bottom-up structure, similar to PANet, it was adopted accordingly in this paper. Additionally, a skip connection between the original feature and the fused feature in the bottom-up structure is added in our proposed NCSBFF module, inspired by the skip connection in U-shaped networks. Generally, in a U-shaped network, the skip connection is designed between encoders and decoders to propagate information lost during the down-sampling operation, thereby aiding in the recovery of fine-grained information [52]. The original features

F_{i}

are propagated at the same scale through sequential upsampling in the top-down structure, downsampling in the bottom-up structure, and skip connections, compensating for the loss of spatial positional information in the fused features. Let

F_{_{f u s e}_{i}}

denote the output of the ith scale.

F_{_{f u s e}_{i}}

is formulated as follows:

F_{i}^{'} = \{\begin{matrix} Conv (F_{i}), & if i = 4, \\ Conv (F_{i} + Upsampling (F_{i + 1})), & if i = 2, 3, \\ Conv (SA (F_{i}) + Upsampling (F_{i + 1})), & if i = 1 . \end{matrix}

(1)

F_{_{f u s e}_{i}} = Conv (F_{i} + F_{i}^{'} + Downsampling (F_{{fuse}_{i - 1}}))

(2)

where

F_{i}^{'}

denotes the output node in the top-down structure, which takes original features Fi and the upsampling of

F_{i + 1}^{'}

as input, with each node representing a depthwise separable convolution block. Subsequently, the same-scale node in the bottom-up structure takes three features as inputs:

F_{i}

,

F_{i}^{'}

, and the downsampling of

F_{{fuse}_{i - 1}}

. In this way, through the element wise addition of

F_{i}

in the bottom-up structure, detailed spatial information is transmitted to the fused features, compensating for the absence of positional information in segmentation tasks. When

i = 1

,

F_{1}

is passed to a Shuffle Attention (SA) block [22]. SA is selected because it not only combines channel attention and spatial attention but also achieves lower model complexity. Since

F_{1}

contains more detailed information compared to the other three features, an SA block is employed to adaptively filter out weak details and highlight critical information for classification. First,

F_{1} \in R^{128 \times 128 \times 48}

is divided into four groups of sub-feature maps,

F_{s u b_k} \in R^{128 \times 128 \times 48},

and each sub-feature map is further divided into two maps, i.e.,

F_{s u b_k 1} \in R^{128 \times 128 \times 6}

and

F_{s u b_k 2} \in R^{128 \times 128 \times 6},

which are passed through two branches. These two branches are used to obtain the channel attention and spatial attention map, respectively, as follows:

F_{s u b_k 2}^{'} = φ (f_{c} (f_{gap} (F_{s u b_k 1}))) \cdot F_{s u b_k 1}

(3)

F_{s u b_k 2}^{'} = φ (f_{c} (G N (F_{s u b_k 2}))) \cdot F_{s u b_k 2}

(4)

where

f_{g a p}

and

G N

denote global average pooling and group norm operations, respectively. Subsequently, a series of operations involving a scaling transformation function

f_{c}

and a sigmoid activation function

ϕ

are performed to compute the attention maps. Finally, an elementwise multiplication is performed between each subgroup feature and the computed attention map, generating the output of the channel attention and spatial attention branches. Ultimately, outputs of the two branches are concatenated, and these subfeatures are aggregated to obtain the output of the SA block.

3.4. Segmentation Head

We obtain the fused features

F_{_{f u s e}_{i}}

(where

i = 1, 2, 3, 4

) from the NCSBFF module, which contain both low-level spatial details and high-level semantic information. To achieve fine-grained segmentation predictions with fewer parameters, we directly add these features via element-wise addition to obtain the final fusion features. Subsequently, the final fusion features pass through a 3 × 3 convolution layer for pixel-wise segmentation, resulting a resolution of 1/4 of the original image. Finally, the segmentation map is upsampled to the original resolution. The NCSBFF module processes input features to generate four multi-scale feature maps with identical channel dimensionality (128 channels) but varying spatial resolutions. To enable feature fusion through element-wise summation, each feature map undergoes bilinear upsampling with progressive scaling factors of 8, 4, 2, and 1, respectively. The fused feature representation is subsequently processed through a 3 × 3 convolutional layer for cross-scale integration, followed by a final four-times bilinear upsampling operation to restore the original image resolution, thereby generating the semantic segmentation map.

4. Experiments and Results

4.1. Experimental Settings and Evaluation Metrics

The experiments were carried out on a single NVIDIA GeForce 4090 GPU (12 GB). Our model was optimized using the Adam optimizer with an initial learning rate of 0.0005, a weight decay of

1 \times 10^{- 8}

, and betas of (0.9, 0.999). Additionally, a step learning rate strategy was applied to adjust learning rate, where the learning rate is halved every 20 epochs. For training, we employed the standard cross-entropy loss function.The batch size was set to 8. Furthermore, three data augmentation techniques, i.e., random horizontal flipping, random vertical flipping, and random 90-degree rotation, were applied during the training process. The evaluation metrics include precision, recall, average

F 1

score (

A v e . F 1

), mean intersection over union (

m I o U

), and overall accuracy (

O A

). For each category,

i o u

is defined as the ratio of the intersection and union of the prediction and ground truth, and the

F 1

score is the harmonic mean of precision and recall, calculated as follows:

OA = \frac{T P}{T P + F P + F N}

(5)

IoU = \frac{T P}{T P + F P + F N}

(6)

precision = \frac{T P}{T P + F P}

(7)

recall = \frac{T P}{T P + F N}

(8)

F 1 = 2 \times \frac{precision \times recall}{precision + recall}

(9)

Therefore, we obtain

m I o U

and

A v e

.

F 1

represent the average values of

I o U

and

F 1

across all categories. Furthermore, we also use floating-point operations per second (FLOPs) and the number of model parameters (Params) to measure the computational efficiency and memory consumption of different models.

4.2. Datasets

Our nested cross-scale and bidirectional feature fusion network is evaluated on two standard remote sensing image datasets, i.e., the Potsdam and Vaihingen datasets. Potsdam, a typical historic city, is characterized by large buildings and narrow streets, while Vaihingen is a relatively small town with independent, small-sized buildings. Therefore, the characteristics of the two datasets are quite distinct.

Potsdam Dataset [53]: The Potsdam dataset contains 38 high-resolution true orthophoto (TOP) images, each with the size of 6000 × 6000 pixels over Potsdam City, Germany, and the ground sampling distance is 5 cm. Each image has three channel combinations, namely, R-G-B, R-G-B-IR, and IR-R-G. In addition, the dataset offers two types of annotations: non-eroded and eroded options, which correspond to having and not having the boundary. To avoid ambiguity in labeling boundaries, all experimental results are performed and benchmarked on the eroded boundary dataset. Following the experimental setup [54], we divide the dataset into 24 images with R-G-B for training and 14 images with R-G-B for testing (image IDs: 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15, and 7_13). The dataset consists of six categories: impervious surfaces, buildings, low vegetation, trees, cars, and clutter/background. Each image is overlap-partitioned into a set of sub-images of size 512 × 512 with a step size of 256 by 256.
Vaihingen Dataset [53]: The Vaihingen dataset contains 33 TOP images collected by advanced airborne sensors, covering 1.38 km² area of Vaihingen. The ground sampling distance is about 9 cm. Each TOP image has IR, R, and G channels. The dataset have the same six categories as the Potsdam Dataset. Following [55], we select 11 images for training (image IDs: 1, 3, 5, 7, 13, 17, 21, 23, 26, 32, and 37) and 5 images for testing (image IDs: 11, 15, 28, 30, and 34). Each image is overlap-partitioned into a set of sub-images sized of 512 × 512 with a step size of 256 by 256.

4.3. Comparison with State-of-the-Art Methods

We compare the performance of our proposed method against six state-of-the-art methods on the Potsdam and Vaihingen datasets, including DeepLabV3+ [56], PANet [43], Linknet [57], Multi-Scale Attention Network (MAnet) [58], Pyramid Scene Parsing Network (PSPNet) [23], and Fully Convolutional Network (FCN) [15]. ResNet50 was selected as the backbone architecture for the comparative methods to extract multi-level features.

DeepLabV3+: A fourth-generation version of Google’s DeepLab series, enhanced with the atrous spatial pyramid pooling (ASPP) module. ASPP, with varying dilated rates, extracts features from different scales in paralle within the encoder. The decoder restores image resolution through two bilinear interpolations processes.
PANet: A newtork that features an integrated top-down and bottom-up structure, enabling two-way interaction between high-resolution low-level features and coarse-resolution high-level features.
Linknet: A U-shaped architecture in which encoder features generated by the backbone are directly linked through deconvolution and element-wise addition operations.
MAnet: MANet combines position-wise attention blocks and multi-scale fusion attention blocks to adaptively integrate local features with their global dependencies based on the attention mechanism.
PSPNet: PSPNet incorporates a pyramid pooling module (PPM) to aggregate contextual information. The PPM generates representations of different sub-regions. These features are subsequently upsampled and concatenated, allowing the model to integrate both local and global features.
FCN: FCN is a classical CNN-based semantic segmentation model. In the decoder, fully connected layers are replaced with deconvolutional or upsampling layers and feature maps are restored to the image’s input size to achieve pixel-by-pixel predictions.

The accuracy on the Potsdamand Vaihingen datasets is presented in Table 2 and Table 3, respectively. Since the clutter/background class can consist of anything except impervious surfaces, buildings, low vegetation, trees, and cars, it is the most challenging class type to identify. In line with previous work [59], the proposed method was benchmarked using metrics such as

m I o U

,

O A

,

A v e

.

F 1

, and

F 1

per category. In the Potsdam Dataset, among the six comparative methods, DeepLabV3+ achieved the best mIoU and

A v e

.

F 1

, while FCN performed best in

O A

. The proposed method achieved the best scores on the three overall metrics, with improvements of 1.14% in

m I o U

, 0.6% in

O A

, and 0.82% in

A v e

.

F 1

compared to DeepLabV3+. For

F 1

per category, the proposed method and DeepLabV3+ ranked first and second across all class types, except cars. For the relatively challenging small-scale target, the car class, FCN achieved the best result, with an F1 score of 96.14%, followed by the proposed method, with 95.63%. Regarding the most challenging class type, i.e., clutter, it had the lowest F1 score among all categories, remaining below 60%. However, our proposed method reached 61.74%, showing a significant increase of 2.19% compared to DeepLabV3+.

Furthermore, to visually demonstrate the distinctions between our proposed method and the comparative methods, segmentation results from five representative regions are compared in Figure 3. As anticipated, our proposed method yields superior segmentation results compared to the other methods. As seen in Regions #1, #2, and #5, the proposed method successfully segments the complete outline of a large building without misclassified pixels in the inner regions. In Regions #3 and #4, for smaller clutter and small-patch low vegetation, our proposed method also segments them more accurately. For instance, in Region #1, buildings are correctly segmented only by NCSBFF-Net, while other comparative methods misclassify them as impervious surfaces and low vegetation. In Region #2, the bright building roofs are segmented without misclassification errors, with relatively accurate boundaries, by our proposed method. In Region #3, only FCN and NCSBFF-Net can better delineate the shapes of clutter objects, particularly in areas sheltered by trees. In Region #4, clutter objects with tiny linear shapes and large circular structures are better delineated by the proposed NCSBFF-Net, highlighting its ability to capture multi-scale information. In Region #5, the geometry of the buildings and the edges of low vegetation are better preserved in the proposed method compared to the others. Therefore, for objects of varying scales, the proposed method achieves more accurate segmentation results with fewer misclassifications and better-delineated boundaries. The proposed approach, however, remains susceptible to minor misclassification errors in small-target categories (particularly cars), stemming from two principal confounding factors: (i) partial occlusion induced by tree shadows; (ii) geometric distortions coupled with insufficiently distinct edge features at image boundaries, potentially due to boundary blurring or inadequate feature extraction. Representative error patterns are systematically documented in Figure 4.

In the Vaihingen dataset, our proposed method achieves 73.66% mIoU and 83.56%

A v e

.

F 1

, outperforming the other methods. In contrast to the suboptimal approach,

m I o U

increases by 3.36% and

A v e

.

F 1

by 3.13%. For the overall evaluation metric OA, our method achieves a score of 90.72%, second only to MANet. In addition, for the most challenging class, clutter, our approach achieves a score of 55.71% and outperforms the second-ranked PANet by 6.02%. For buildings with varied-scale features, our method attains an

F 1

score of 95.58%. To visually highlight the distinctions between our proposed method and other approaches, we selected five representative regions in the Vaihingen dataset, and their segmentation results for each approach are presented in Figure 5. Compared to the other methods, our proposed method predicts the segmentation maps that are the closest to the ground truth maps.

4.4. Ablation Study

In this section, we conduct extensive ablation experiments on the proposed model using two datasets. Specifically, we deconstruct our method into distinct combinations (NCSBFF module and SA block) of its components and evaluate their impact. In the baseline method, we use the EfficientNet-V2 backbone as the encoder, and the decoder is implemented through simple upsampling and summation operations. EfficientNetV2-s achieves an optimal balance among depth, width, and resolution through its unified compound scaling strategy, making it more suitable for handling multi-scale objects. Additionally, it exhibits high compatibility with the proposed Nested Cross-Scale and Bidirectional Feature Fusion (NCSBFF) module and the Shuffle Attention (SA) mechanism, which further enhances the model’s performance. As shown in Table 4, for the Potsdam dataset, the baseline method attains the second best

O A

at 92.65%; however, it yields the lowest values for

m I o U

and

A v e

.

F 1

. For the Vaihingen Dataset, the baseline method’s performance is the lowest for all three metrics. After introducing the NCSBFF module and SA block,

m I o U

and

A v e

.

F 1

consistently improve, except for

O A

in the Potsdam dataset, demonstrating the effectiveness of each component in our model. Specifically, the two-way feature fusion across multi-scale features by the NCSBFF module brings gains in accuracy. After further introducing the SA block in the NCSBFF module, more significant improvements in both mIoU and

A v e

.

F 1

are observed for the two datasets, i.e., 1.02%

m I o U

and 0.86%

A v e

.

F 1

for the Potsdam dataset and 2.44%

m I o U

and 2.25%

A v e

.

F 1

for the Vaihingen dataset. This indicates the effectiveness of the SA block in filtering out weak details and highlighting critical information in high-resolution low-level features before passing them into the multi-scale feature fusion. The SA module enhances the representation capability for small objects and fine-grained details through the shuffled attention mechanism, thereby improving the

m I o U

and

F 1

scores. However, due to the relatively small pixel proportion of small objects, their contribution to the

O A

is limited, resulting in

O A

remaining unchanged. In summary, the integration of the SA block and NCSBFF module improves the performance of the model.

Several sets of segmentation results for each component of our proposed method are presented in Figure 6. Although the baseline network has achieved some segmentation results, it is challenging to segment several semantically ambiguous regions, such as roof superstructures (Case #1), buildings covered with fallen leaves (Case #2), impervious surface with a similar color to building roofs (Case #3), and regions shaded by trees or cast in shadow. In addition, it is difficult for the baseline network to accurately delineate the complex contours of objects. When the NCSBFF module was introduced, the multi-scale feature fusion ability was improved, enhancing the model’s performance in identifying objects of various scales. When the SA block is further integrated into the NCSBFF module, edges are more accurately captured, and different regions with similar pixels can be more effectively distinguished. The enhanced multi-scale fusion capability improves small object recognition accuracy through cross-level feature complementarity. For instance, objects such as vehicles may initially be detected as edge points in shallow-layer features, yet they require deep-layer features to confirm their identity as ‘vehicles’ rather than noise. The Shuffle Attention (SA) module further enhances this process by directing the model’s attention to the regular geometric shapes of small-scale rooftops (via high attention weights) while suppressing irrelevant surrounding textures such as cluttered vegetation. This synergistic mechanism—leveraging multi-scale context refinement and spatially adaptive attention—effectively elevates the precision of small object recognition in complex remote sensing scenarios. As illustrated in Figure 7, our model demonstrates superior performance in recognizing small objects compared to baseline methods. The visual results in Figure 7 highlight the model’s ability to accurately identify and localize small objects, such as vehicles and rooftops, while effectively suppressing noise and irrelevant textures. This further validates the effectiveness of our proposed multi-scale fusion and attention mechanisms in enhancing small object recognition.

4.5. Multi-Scale Image Impact

When the input image size was set to 384 × 384, the

m I o U

,

O A

, and

F 1

metrics exhibited varying degrees of decline compared to the 512 × 512 configuration. In remote sensing segmentation tasks, the significant scale variations of ground objects critically impact the network’s capacity to harmonize contextual information and local details. Smaller input sizes, despite retaining high-frequency fine details, are hindered by restricted receptive fields, which degrade the representation of long-range spatial correlations. Consequently, this results in diminished segmentation accuracy compared to larger input sizes, as evidenced by our experiments on the Postdam dataset, where reduced input resolutions consistently underperformed in holistic context modeling. Detailed quantitative comparisons are provided in Table 5.

4.6. Model Computational Complexity

In order to evaluate the balance between the accuracy and efficiency of the proposed and comparative methods, mIoU, number of parameters (Params), floating points of operations (FLOPs), and inference time are reported in Table 6. In terms of the mIoU metric, our proposed model achieves improvements of 5.01% and 5.31% over PSPNet (a method with lower computational complexity) on the Potsdam and Vaihingen datasets, respectively, demonstrating a favorable balance between performance and efficiency. For instance, on the Potsdam dataset, while DeepLabV3+ attains the second-highest mIoU, our model achieves the highest mIoU, while also reducing parameter count by 24.28% and computational cost by 43.99%. Similarly, on the Vaihingen dataset, our model outperforms FCN (the second-best performer in mIoU), with a reduction of 38.69% in parameters and 85.1% in computational load.

5. Conclusions

In this paper, we propose a novel lightweight network, NCSBFF-Net, for the semantic segmentation of high-resolution remote sensing images. With efficent feature extraction using EfficientNetV2 and feature fusion through the combined NCSBFF module and SA block, NCSBFF-Net effectively addresses the issues of intra-class heterogeneity, inter-class homogeneity, and the presence of tiny objects, simultaneously. Extensive experiments confirm that the proposed NCSBFF-Net achieves superior performance in semantic segmentation on the Potsdam and Vaihingen benchmarks. An extensive ablation study was conducted to evaluate the impact of the individual components of the proposed method. Compared to the most lightweight comparative methods, our proposed method achieves a more than 5% improvement in

m I o U

, with only 1.73% of the computational complexity. However, our method still requires sufficient sample support. Future work may consider pseudo-labeling approaches, as suggested in [50].

Author Contributions

Conceptualization, S.Z.; Methodology, Y.T.; Software, S.Z. and D.W.; Validation, B.Z.; Formal analysis, S.Z., B.Z. and D.W.; Investigation, B.Z. and Y.T.; Resources, S.Z.; Data curation, D.W.; Writing—original draft, S.Z. and D.W.; Writing—review & editing, B.Z. and D.W.; Visualization, S.Z.; Supervision, B.Z., D.W. and Y.T.; Project administration, D.W. and Y.T.; Funding acquisition, Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 41901279 and in part by the Science Foundation Research Project of the Wuhan Institute of Technology, China, under Grant K202239.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Duo, L.; Wang, J.; Zhang, F.; Xia, Y.; Xiao, S.; He, B.-J. Assessing the spatiotemporal evolution and drivers of ecological environment quality using an enhanced remote sensing ecological index in Lanzhou City, China. Remote Sens. 2023, 15, 4704. [Google Scholar] [CrossRef]
Jiang, D.; Jones, I.; Liu, X.; Simis, S.G.; Cretaux, J.-F.; Albergel, C.; Tyler, A.; Spyrakos, E. Impacts of droughts and human activities on water quantity and quality: Remote sensing observations of Lake Qadisiyah, Iraq. Int. J. Appl. Earth Obs. Geoinf. 2024, 132, 104021. [Google Scholar]
Ding, L.; Zhang, J.; Bruzzone, L. Semantic segmentation of large-size VHR remote sensing images using a two-stage multiscale training architecture. IEEE Trans. Geosci. Remote Sens. 2020, 58, 5367–5376. [Google Scholar] [CrossRef]
Luo, H.; Chen, C.; Fang, L.; Khoshelham, K.; Shen, G. MS-RRFSegNet: Multiscale regional relation feature segmentation network for semantic segmentation of urban scene point clouds. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8301–8315. [Google Scholar]
Zhao, J.; Zhou, Y.; Shi, B.; Yang, J.; Zhang, D.; Yao, R. Multi-stage fusion and multi-source attention network for multi-modal remote sensing image segmentation. ACM Trans. Intell. Syst. Technol. 2021, 12, 1–20. [Google Scholar]
Liu, G.; Li, L.; Jiao, L.; Dong, Y.; Li, X. Stacked Fisher autoencoder for SAR change detection. Pattern Recognit. 2019, 96, 106971. [Google Scholar] [CrossRef]
Sahar, L.; Muthukumar, S.; French, S.P. Using aerial imagery and GIS in automated building footprint extraction and shape recognition for earthquake risk assessment of urban inventories. IEEE Trans. Geosci. Remote Sens. 2010, 48, 3511–3520. [Google Scholar]
Sheikh, R.; Milioto, A.; Lottes, P.; Stachniss, C.; Bennewitz, M.; Schultz, T. Gradient and log-based active learning for semantic segmentation of crop and weed for agricultural robots. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1350–1356. [Google Scholar]
Yu, Y.; Bao, Y.; Wang, J.; Chu, H.; Zhao, N.; He, Y.; Liu, Y. Crop row segmentation and detection in paddy fields based on treble-classification Otsu and double-dimensional clustering method. Remote Sens. 2021, 13, 901. [Google Scholar] [CrossRef]
Wu, Y.; Yang, X.; Plaza, A.; Qiao, F.; Gao, L.; Zhang, B.; Cui, Y. Approximate computing of remotely sensed data: SVM hyperspectral image classification as a case study. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 5806–5818. [Google Scholar]
Han, W.; Zhang, X.; Wang, Y.; Wang, L.; Huang, X.; Li, J.; Wang, S.; Chen, W.; Li, X.; Feng, R.; et al. A survey of machine learning and deep learning in remote sensing of geological environment: Challenges, advances, and opportunities. ISPRS J. Photogramm. Remote Sens. 2023, 202, 87–113. [Google Scholar]
Yi, Y.; Zhang, Z.; Zhang, W.; Zhang, C.; Li, W.; Zhao, T. Semantic segmentation of urban buildings from VHR remote sensing imagery using a deep convolutional neural network. Remote Sens. 2019, 11, 1774. [Google Scholar] [CrossRef]
Yu, B.; Yang, L.; Chen, F. Semantic segmentation for high spatial resolution remote sensing images based on convolution neural network and pyramid pooling module. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 3252–3261. [Google Scholar]
Zheng, C.; Hu, C.; Chen, Y.; Li, J. A self-learning-update CNN model for semantic segmentation of remote sensing images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Huang, L.; Jiang, B.; Lv, S.; Liu, Y.; Fu, Y. Deep learning-based semantic segmentation of remote sensing images: A survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 8370–8396. [Google Scholar]
Lin, G.; Liu, F.; Milan, A.; Shen, C.; Reid, I. RefineNet: Multi-path refinement networks for dense prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 1228–1242. [Google Scholar]
Wu, Z.; Liu, C.; Song, B.; Pei, H.; Li, P.; Chen, M. Diff-HRNet: A diffusion model-based high-resolution network for remote sensing semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 22, 6000505. [Google Scholar]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-A: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar]
Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C. Multistage attention ResU-Net for semantic segmentation of fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar]
Zhang, Q.-L.; Yang, Y.-B. SA-Net: Shuffle attention for deep convolutional neural networks. arXiv 2021, arXiv:2102.00240. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2881–2890. [Google Scholar]
Wang, F.; Piao, S.; Xie, J. CSE-HRNet: A context and semantic enhanced high-resolution network for semantic segmentation of aerial imagery. IEEE Access 2020, 8, 182475–182489. [Google Scholar]
Zuo, R.; Zhang, G.; Zhang, R.; Jia, X. A deformable attention network for high-resolution remote sensing images semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar]
Shafique, A.; Cao, G.; Khan, Z.; Asad, M.; Aslam, M. Deep learning-based change detection in remote sensing images: A review. Remote Sens. 2022, 14, 871. [Google Scholar] [CrossRef]
Huo, B.; Li, C.; Zhang, J.; Xue, Y.; Lin, Z. SAFF-SSD: Self-attention combined feature fusion-based SSD for small object detection in remote sensing. Remote Sens. 2023, 15, 3027. [Google Scholar] [CrossRef]
Deng, L.; Wang, Y.; Lan, Q.; Chen, F. Remote sensing image building change detection based on Efficient-UNet++. J. Appl. Remote Sens. 2023, 17, 034501. [Google Scholar]
Feng, J.; Wang, H. A multi-scale contextual attention network for remote sensing visual question answering. Int. J. Appl. Earth Obs. Geoinf. 2024, 126, 103641. [Google Scholar]
Su, Y.; Cheng, J.; Wang, W.; Bai, H.; Liu, H. Semantic segmentation for high-resolution remote-sensing images via dynamic graph context reasoning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar]
Chen, S.; Lei, F.; Zang, Z.; Zhang, M. Forest mapping using a VGG16-UNet++ & stacking model based on Google Earth Engine in the urban area. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar]
Su, Y.; Cheng, J.; Bai, H.; Liu, H.; He, C. Semantic segmentation of very-high-resolution remote sensing images via deep multi-feature learning. Remote Sens. 2022, 14, 533. [Google Scholar] [CrossRef]
Li, J.; Hu, Y.; Huang, X. CASAFormer: A cross-and self-attention based lightweight network for large-scale building semantic segmentation. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103942. [Google Scholar]
Li, X.; Li, J. MFCA-Net: A deep learning method for semantic segmentation of remote sensing images. Sci. Rep. 2024, 14, 5745. [Google Scholar]
Li, L.; Ding, J.; Cui, H.; Chen, Z.; Liao, G. LITEMS-Net: A lightweight semantic segmentation network with multi-scale feature extraction for urban streetscape scenes. Vis. Comput. 2024, 41, 2801–2815. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; PMLR: New York, NY, USA, 2019; pp. 6105–6114. [Google Scholar]
Chen, F.; Tsou, J.Y. Assessing the effects of convolutional neural network architectural factors on model performance for remote sensing image classification: An in-depth investigation. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102865. [Google Scholar] [CrossRef]
Wang, B.; Huang, G.; Li, H.; Chen, X.; Zhang, L.; Gao, X. Hybrid CBAM-EfficientNetV2 fire image recognition method with label smoothing in detecting tiny targets. Mach. Intell. Res. 2024, 21, 1145–1161. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4510–4520. [Google Scholar]
Zhang, X.; Zhao, Z.; Ran, L.; Xing, Y.; Wang, W.; Lan, Z.; Yin, H.; He, H.; Liu, Q.; Zhang, B.; et al. FastICENet: A real-time and accurate semantic segmentation model for aerial remote sensing river ice image. Signal Process. 2023, 212, 109150. [Google Scholar] [CrossRef]
He, H.; Zhou, F.; Xia, Y.; Chen, M.; Chen, T. Parallel fusion neural network considering local and global semantic information for citrus tree canopy segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 1535–1549. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, S.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 10781–10790. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Zhao, S.; Feng, Z.; Chen, L.; Li, G. DANet: A semantic segmentation network for remote sensing of roads based on dual-ASPP structure. Electronics 2023, 12, 3243. [Google Scholar] [CrossRef]
Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-conquer: Confluent triple-flow network for RGB-T salient object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1958–1974. [Google Scholar] [CrossRef]
Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. arXiv 2018, arXiv:1805.10180. [Google Scholar]
Chi, Y.; Li, J.; Fan, H. Pyramid-attention based multi-scale feature fusion network for multispectral pan-sharpening. Appl. Intell. 2022, 52, 5353–5365. [Google Scholar] [CrossRef]
Zheng, Z.; Ermon, S.; Kim, D.; Zhang, L.; Zhong, Y. ChangeNet2: Multi-temporal remote sensing generative change foundation model. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 725–741. [Google Scholar] [CrossRef] [PubMed]
Tan, M.; Le, Q. EfficientNetV2: Smaller models and faster training. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 10096–10106. [Google Scholar]
Wang, J.; Wang, B.; Wang, X.; Zhao, Y.; Long, T. Hybrid attention-based U-shaped network for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Rottensteiner, F.; Sohn, G.; Jung, J.; Gerke, M.; Baillard, C.; Benitez, S.; Breitkopf, U. The ISPRS benchmark on urban object classification and 3D building reconstruction. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2012, 1, 293–298. [Google Scholar] [CrossRef]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. High-resolution aerial image labeling with convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 7092–7103. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 801–818. [Google Scholar]
Chaurasia, A.; Culurciello, E. LinkNet: Exploiting encoder representations for efficient semantic segmentation. In 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–4. [Google Scholar]
Fan, T.; Wang, G.; Li, Y.; Wang, H. MA-Net: A multi-scale attention network for liver and tumor segmentation. IEEE Access 2020, 8, 179656–179665. [Google Scholar] [CrossRef]
Hanyu, T.; Yamazaki, K.; Tran, M.; McCann, R.A.; Liao, H.; Rainwater, C.; Adkins, M.; Cothren, J.; Le, N. AerialFormer: Multi-resolution transformer for aerial image segmentation. Remote Sens. 2024, 16, 2930. [Google Scholar] [CrossRef]

Figure 1. The architecture of the proposed network, including the following: (1) the EfficientNetV2-s backbone for feature extraction; (2) the Nested Cross-Scale and Bidirectional Feature Fusion (NCSBFF) module; and (3) the segmentation head.

Figure 2. The structure of Fused-MBConv and MBConv.

Figure 3. Qualitative comparison of the visualization results of our method with other methods on the Potsdam dataset.

Figure 4. The error patterns depicted here encompass partial occlusion induced by tree shadows and geometric distortions coupled with poorly defined edge features, highlighting challenges in boundary detection.

Figure 5. Qualitative comparison of the visualization results of our method with other methods on the Vaihingen dataset.

Figure 6. Qualitative Visualization of ablation experiments on the Potsdam dataset (top) and Vaihingen dataset (bottom).

Figure 7. Visual results of small object recognition.

Table 1. Network structure of the EfficientNetV2_S backbone.

Stage	Operator	Kernel Size	Stride	Dimensions	#Layers
Stem	Conv3 × 3	3 × 3	2	256 × 256 × 24	1
1	Fused-MBConv	3 × 3	1	128 × 128 × 24	2
2	Fused-MBConv	3 × 3	2	128 × 128 × 48	4
3	Fused-MBConv	3 × 3	2	64 × 64 × 64	4
4	MBConv	3 × 3	2	32 × 32 × 128	6
5	MBConv	3 × 3	1	32 × 32 × 160	9
6	MBConv	3 × 3	2	16 × 16 × 256	15

Table 2. Quantitative performance comparison with state-of-the-art methods on the Potsdam dataset.

Method	mIoU	OA	Ave. F1	F1 per Category
Method	mIoU	OA	Ave. F1	Imp. Surf.	Buildings	Low Veg.	Trees	Cars	Clutter
DeepLabV3+	78.10	92.05	86.50	92.91	96.41	86.56	88.26	95.32	59.55
PAN	76.94	91.77	85.61	92.53	96.15	85.67	87.66	95.00	56.66
Linknet	77.28	92.05	85.85	92.45	96.29	85.56	87.78	94.93	57.12
MAnet	76.15	91.84	84.88	92.51	96.05	85.62	87.78	94.43	52.87
PSPNet	74.23	90.82	83.64	91.17	94.29	84.72	87.05	93.12	51.51
FCN	77.58	92.07	85.95	92.78	96.26	86.35	87.96	96.14	56.23
Proposed	79.24	92.65	87.32	93.43	97.01	87.40	88.69	95.63	61.74

Bold values indicate the best performance in each column.

Table 3. Quantitative performance comparison with state-of-the-art methods on the Vaihingen dataset.

Method	mIoU	OA	Ave. F1	F1 per Category
Method	mIoU	OA	Ave. F1	Imp. Surf.	Buildings	Low Veg.	Trees	Cars	Clutter
DeepLabV3+	70.30	90.51	80.43	92.29	94.93	83.72	88.79	80.05	42.79
PANet	71.82	90.40	81.96	91.99	95.18	83.43	88.72	82.77	49.69
Linknet	71.57	90.61	81.41	92.16	95.34	83.41	89.13	83.92	44.52
MAnet	69.97	90.88	79.42	92.19	95.37	84.11	89.33	82.28	33.25
PSPNet	68.35	89.15	78.95	90.33	93.10	82.84	88.41	79.70	39.33
FCN	72.23	90.47	82.10	92.17	94.79	83.11	89.32	85.54	47.64
Proposed	73.66	90.72	83.56	92.35	95.58	83.63	88.98	85.08	55.71

Bold values indicate the best performance in each column.

Table 4. Ablation study results on the NCSBFF module and SA block.

Method	NCSBFF	SA	Potsdam			Vaihingen
Method	NCSBFF	SA	mIoU	OA	Ave. F1	mIoU	OA	Ave. F1
Baseline	✘	✘	78.22	92.65	86.46	71.22	90.61	81.31
NCSBFF without SA	✔	✘	78.68	92.73	86.84	72.40	90.65	82.50
Proposed	✔	✔	79.24	92.65	87.32	73.66	90.72	83.56

Bold values indicate the best performance in each column. ✔: Enabled; ✘: Disabled.

Table 5. Generalization analysis of the proposed approach for multi-scale image inputs.

Input Size	mIoU	OA	F1
512 × 512	79.24	92.65	87.32
384 × 384	75.75	91.81	84.56

Table 6. Comparative analysis of model complexity (input size: 224 × 224).

Method	Params (MB)	FLOPs (G)	Potsdam (mIoU)	Vaihingen (mIoU)	Inference Time (ms)
DeepLabV3+	101.77	7.07	78.10	70.30	2.9
PAN	92.55	6.69	76.94	71.82	3.4
Linknet	118.93	8.27	77.28	71.57	3.1
MAnet	562.44	14.32	76.15	69.97	5.6
PSPNet	8.62	2.29	74.23	68.35	1.5
FCN	125.69	26.58	77.58	72.23	4.0
NCSBFF + ResNet50	95.52	4.99	–	–	–
Proposed	77.06	3.96	79.24	73.66	5.2

Bold values indicate the best performance in each column. “–” denotes unavailable data.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, S.; Zhang, B.; Wen, D.; Tian, Y. NCSBFF-Net: Nested Cross-Scale and Bidirectional Feature Fusion Network for Lightweight and Accurate Remote-Sensing Image Semantic Segmentation. Electronics 2025, 14, 1335. https://doi.org/10.3390/electronics14071335

AMA Style

Zhu S, Zhang B, Wen D, Tian Y. NCSBFF-Net: Nested Cross-Scale and Bidirectional Feature Fusion Network for Lightweight and Accurate Remote-Sensing Image Semantic Segmentation. Electronics. 2025; 14(7):1335. https://doi.org/10.3390/electronics14071335

Chicago/Turabian Style

Zhu, Shihao, Binqiang Zhang, Dawei Wen, and Yuan Tian. 2025. "NCSBFF-Net: Nested Cross-Scale and Bidirectional Feature Fusion Network for Lightweight and Accurate Remote-Sensing Image Semantic Segmentation" Electronics 14, no. 7: 1335. https://doi.org/10.3390/electronics14071335

APA Style

Zhu, S., Zhang, B., Wen, D., & Tian, Y. (2025). NCSBFF-Net: Nested Cross-Scale and Bidirectional Feature Fusion Network for Lightweight and Accurate Remote-Sensing Image Semantic Segmentation. Electronics, 14(7), 1335. https://doi.org/10.3390/electronics14071335

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

NCSBFF-Net: Nested Cross-Scale and Bidirectional Feature Fusion Network for Lightweight and Accurate Remote-Sensing Image Semantic Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Lightweight Networks

2.2. Multi-Scale Feature Fusion

3. Materials and Methods

3.1. Overview

3.2. Lightweight Feature Pyramid Module

3.3. Nested Cross-Scale and Bidirectional Feature Fusion Module

3.4. Segmentation Head

4. Experiments and Results

4.1. Experimental Settings and Evaluation Metrics

4.2. Datasets

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Study

4.5. Multi-Scale Image Impact

4.6. Model Computational Complexity

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI