AFNE-Net: Semantic Segmentation of Remote Sensing Images via Attention-Based Feature Fusion and Neighborhood Feature Enhancement

Li, Ke; Ji, Hao; Li, Zhijiang; Cui, Zeyu; Liu, Chengkai

doi:10.3390/rs17142443

Open AccessArticle

AFNE-Net: Semantic Segmentation of Remote Sensing Images via Attention-Based Feature Fusion and Neighborhood Feature Enhancement

by

Ke Li

¹

,

Hao Ji

^2,*,

Zhijiang Li

³

,

Zeyu Cui

² and

Chengkai Liu

²

¹

School of Urban Design, Wuhan University, Wuhan 430072, China

²

School of Earth and Space Science and Technology, Wuhan University, Wuhan 430072, China

³

School of Information Management, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2443; https://doi.org/10.3390/rs17142443

Submission received: 2 May 2025 / Revised: 27 June 2025 / Accepted: 11 July 2025 / Published: 14 July 2025

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Understanding remote sensing imagery is vital for object observation and planning. However, the acquisition of optical images is inevitably affected by shadows and occlusions, resulting in local discrepancies in object representation. To address these challenges, this paper proposes AFNE-Net, a general network architecture for remote sensing image segmentation. First, the model introduces an attention-based feature fusion module. Through the use of weighted fusion of multi-resolution features, this effectively expands the receptive field and enhances semantic associations between categories. Subsequently, a feature enhancement module based on the consistency of neighborhood semantic representation is introduced. This aims to improve the feature representation and reduce segmentation errors caused by local perturbations. Finally, evaluations are conducted on the ISPRS Potsdam, UAVid, and LoveDA datasets to verify the effectiveness of the proposed model.

Keywords:

semantic segmentation; deep learning; feature enhancement; multi-scale fusion; attention

1. Introduction

As remote sensing technology has developed rapidly, the methods of acquiring remote sensing images have become increasingly diverse [1]. Nowadays, remote sensing images are widely used in many fields, such as land use classification [2,3], urban planning [4,5,6], ecological monitoring [7], disaster assessment [8,9], and agricultural monitoring [10,11]. In this context, accurately identifying semantic categories has become a significant focus of current research.

Conventional visual images are typically characterized by high resolution and clear targets [12,13]. In recent years, with the rapid development of deep learning, various vision tasks based on such images have achieved remarkable results. As a result, researchers in the field of remote sensing have begun to adopt these methods to automate remote sensing tasks. However, unlike conventional images, remote sensing imagery exhibits a wider coverage and more diverse semantic categories within a fixed pixel matrix [14,15]. To address these challenges, Transformer-based architectures [16] have been introduced. This model leverage self-attention mechanisms to capture long-range dependencies. Additionally, some studies explore hybrid approaches by combining CNNs with Transformers. For example, DBSANet [8] effectively balances local feature extraction and global context analysis, achieving significant performance improvements. Similar results have been demonstrated with RingMoE [17].

Meanwhile, frequency domain methods have also been introduced into segmentation tasks. FAENet [18] decomposes features using the discrete wavelet transform (DWT) and constructs global relationships based on frequency components. This enhances the representation of features. In contrast, FDNet [19] separates features into components of high and low frequencies, which strengthens the representation of features by compressing redundant information.These approaches significantly alleviate the problem of ambiguous semantic boundaries. However, despite the promising results achieved in the intelligent understanding of remote sensing imagery, there are still several challenges.

Firstly, remote sensing imagery covers large spatial areas and contains objects of various scales. These characteristics make it difficult for a single-scale receptive field to effectively recognize targets of different categories. Although existing methods have introduced Transformer mechanisms to capture global and local relationships, the perceptual scope remains fixed. This hinders the accurate interpretation of targets, especially in regions with densely distributed semantic categories.To address this issue, this study presents an attention multi-scale feature fusion module. This module introduces an attention similarity matrix and a feature confidence matrix. The attention similarity matrix captures global semantic relationships, and the feature confidence matrix identifies stable and ambiguous regions. These two components collaborate to provide global-to-local feature guidance, which effectively corrects ambiguous representations and enhances the stability and discriminative power of the features.

Secondly, existing methods primarily focus on overall semantic relationships, overlooking local regions. Therefore, they still present challenges in addressing complex issues, such as shadow interference and local pixel variations. Due to the relatively low spatial resolution of remote sensing imagery, each pixel often corresponds to a broad area, which leads to ambiguity in representing surface features. In the presence of multiple texture variations within a target, local regions are susceptible to inconsistent pixel representations, which can lead to inaccurate segmentation.To address this issue, the study designs a neighborhood feature enhancement module. This module integrates contextual correlations and spatial semantics from multiple neighborhoods to enhance the perception of local structures, effectively mitigating segmentation errors caused by pixel inconsistency.

Overall, the main contributions of this study are as follows:

An attention multi-scale feature fusion (AF) module is proposed. This module captures global information from the image and constructs a confidence map based on the differences between multi-scale feature. This improves the maintenance of information and enables effective fusion of multi-scale information.
A novel neighborhood feature enhancement (NE) module is designed. This uses semantic representations from neighboring regions to reduce the network’s sensitivity to local pixel variations. This helps the network better capture local expressions of semantic classes.
The AFNE-Net model is introduced for the semantic segmentation of remote sensing images. This model achieves state-of-the-art performance on all publicly available benchmark datasets.

2. Related Works

2.1. Deep Learning for Image Segmentation

Image semantic segmentation using deep learning has made significant progress in recent years [20,21,22]. The initial research in neural networks was characterized by the development of fully convolutional networks (FCNs) [23], which overcame the limitations of traditional methods by employing an end-to-end prediction mechanism. However, due to their limited capacity to capture contextual information, the resulting segmentation is characterized by a coarse resolution. Subsequently, U-Net [24] was proposed, with a layer-by-layer expansion and fusion architecture, which effectively supplemented contextual information to a certain extent. This innovation has played a crucial role in semantic segmentation tasks. Furthermore, PSPNet [25] enhances the ability to capture contextual information through a pyramid scene parsing structure, effectively addressing the multi-scale characteristics of objects. SeNet [26] employs dilated convolutions to extend the receptive field, facilitating the efficient capture of multi-scale scene information and enhancing the image representation capability of the network. AMSNet [7] employs a multi-modal data fusion method that dynamically weights optical and LiDAR features using spatial scale-adaptive fusion. This approach addresses the lack of multi-scale information caused by differences in sensor resolution. LOGCAN++ [3] utilizes an adaptive local–global aware module to optimize performance in complex scenes.

2.2. Transformer Mechanism

The powerful global processing capabilities of attention mechanisms has led to their widespread interest in the field of image segmentation. The utilization of independent CNN imposes constraints on the capacity to discern long-range spatial dependencies. Vaswani et al. [16] proposed a sequence modeling method based on the Transformer architecture, which replaces traditional recurrent and convolutional structures with multi-head self-attention. This approach addresses challenges such as long-range dependency modeling and limited parallel computation. SCA-CNN [27] employs a novel dual attention mechanism that combines spatial and channel information to dynamically adjust multi-layer feature maps. This mechanism effectively utilizes contextual information and enhances feature representation. BETAM [28] introduces a triple attention mechanism, which integrates positional, spatial, and channel attention to improve the recognition of semantic class boundaries. Axial-UNet++ [29] introduces an axial-channel local normalization module, addressing the problem of long-range dependencies in images. TDBAN [30] employs a cross-branch collaborative approach and utilizes data correlation to integrate global and local features.

2.3. Multi-Scale Feature Fusion

High-resolution remote sensing data often face challenges such as target scale variability and environmental diversity [31]. To address these issues, researchers have proposed a series of innovative network architectures. These designs enhance the model’s ability to represent ground objects through multi-level feature interaction mechanisms.

AFENet [32] integrates local and global information to enhance feature extraction. This strategy has been demonstrated to enhance the representation of both global context and fine-grained details, ensuring the completeness of object boundaries. GLFFNet [33] learns adaptive parameters to provide adequate representation of each semantic class. FA-HRNet [34] employs a similar strategy, utilizing an adaptive attention mechanism to provide scale information pertinent to distinct semantic categories. FAENet [18] decomposes features into the frequency domain using the discrete wavelet transform (DWT), enabling multi-scale feature interaction in the frequency space and enhancing sensitivity to pixel variations. However, this approach invariably increases the sensitivity of the network to noise, which can compromise the stability of the model. DFF-Unet [35] extracts semantic and edge features to interpret image semantics from dual perspectives, aiming to meet the feature requirements of different regions. Nevertheless, the method still suffers from shadow interference and pixel ambiguity, which may lead to local mis-segmentation.

3. Methods

The study utilizes the U-Net architecture and is designed to efficiently perform semantic segmentation of remote sensing images. As illustrated in Figure 1, the proposed AFNE-Net is composed of three primary components: an encoder module, an feature fusion module, and a decoder module. The decoder incorporates three neighborhood feature enhancement modules, which focus on feature refinement and the enhancement of spatial semantic representation. The input is a remote sensing image

P \in R^{3 \times H \times W}

. ResNet-18(R18) [36] is used to extract feature vectors

{E_{i}}_{i = 1}^{4}

at different resolutions. The AF module fuses features from neighboring resolutions, enhancing interactions between layers and improving semantic relevance. The fused features are finally passed to the decoder, which strengthens the feature representations based on neighborhood expression consistency. This effectively reduces mis-segmentation caused by local pixel interference. Each component will be described in detail in the following subsections.

3.1. Encoder Module

The encoder is able to project low-variance RGB information into a high-dimensional space, enabling the discriminative representation of different semantic categories in the image. This study uses the U-Net backbone to build the encoder, which consists of four feature encoding modules. Each feature encoding module performs downsampling at a rate that is twice that of the preceding layer. This process is intended to expand the receptive field of the network and establish the foundation for capturing semantic correlations between categories.

R18 is a residual-based feature encoding network, showing high efficiency and accuracy in semantic segmentation tasks. This study adopts ResNet-18 as the feature encoder, aiming to extract deep feature vectors from multi-resolution images.

3.2. Attention Feature Fusion Module

Remote sensing images cover large areas and contain various object types with inconsistent sizes. To effectively represent target features at different scales, it is essential to enhance the feature representation capability of the network from a multi-scale perspective. This helps deal with challenges like feature blurring and semantic ambiguity caused by scale variations. The proposed module introduces diverse weighting matrices to increase the saliency of feature vectors. Additionally, a residual structure is incorporated to preserve fine-grained details. These improvements collectively enhance the precision of the model in segmentation tasks.

As shown in Figure 2, the multi-scale feature fusion module proposed in this paper consists of three AF modules, which are designed to replace traditional feature fusion and transmission mechanisms. Feature maps

{E_{i}}_{i = 1, \dots, 5}

from different levels are input to the corresponding AF module. In order to achieve multi-scale feature fusion, the low-resolution features are first upsampled to match the spatial dimensions of the high-resolution features. Subsequently, these elements are amalgamated as inputs to the fusion process. Finally, the fused feature maps are transmitted to the NE module.

As shown on the right side of Figure 2, a detailed view of the AF module is presented. The low-resolution map

E_{i + 1}

and the high-resolution feature map

E_{i}

capture different scale representations of the image. Multi-resolution features help build semantic connections and improve the accuracy of feature descriptions. Firstly, a convolution operation is applied to project the input features into a unified feature space, as shown in the following equation:

f_{Q} = {C o n v}_{1 \times 1} (U S (E_{i + 1}));

(1)

f_{K} = {C o n v}_{1 \times 1} (U S (E_{i + 1}));

(2)

f_{V} = {C o n v}_{1 \times 1} (E_{i})

(3)

where

U S (\cdot)

denotes the upsampling operation used to match the spatial dimensions.

C o n v_{1 \times 1} (\cdot)

represents a convolution operation with a

1 \times 1

kernel.

f_{Q}

and

f_{K}

are the feature maps after matching the channel dimensions.

Then, a feature similarity matrix is computed to enhance the global perception of the features. Meanwhile, the differences among multi-level features are used to estimate the confidence, reducing the weights of features with large representation differences. The following is a detailed description of the process:

W_{i} = S o f t m a x (f_{Q} \times f_{K}^{T});

(4)

W_{j} = S o f t m a x (f_{V} \cdot f_{K})

(5)

where

S o f t m a x (\cdot)

is the vector normalization function,

W_{i}

is the attention score matrix,

W_{j}

is the confidence matrix,

(\times)

denotes matrix multiplication, and

(\cdot)

denotes the dot product.

Subsequently, the low-resolution feature is multiplied by the attention score matrix to obtain the feature association vector. Meanwhile, the high-resolution feature is point-wise multiplied by the confidence matrix, weighting the features based on the similarity across multiple scales. This promotes consistent expression across layers and reduces feature diversity. Next, the two types of features are concatenated and fused through a convolution operation. A residual connection is introduced to combine the fused result with the high-resolution feature map, preserving fine-grained details and further enhancing multi-scale feature fusion. More information is provided below:

f_{i} = W_{i} \times f_{Q};

(6)

f_{j} = f_{V} \circ W_{j};

(7)

A F_{i} = C o n v_{1 \times 1} (C o n c a t (f_{i}, f_{j})) + E_{i}

(8)

where

C o n c a t (\cdot)

is the feature concatenation function, and

(\circ)

denotes the Hadamard product.

3.3. Feature Enhance Decoder

Due to the wide coverage of images, factors such as surface aging, lighting variations, and shadows can cause significant local differences in the same object at the different images. Convolution operations have been demonstrated to be effective in capturing the spatial relationships between neighbors. This capability contributes to enhancing the modeling of neighborhood consistency. In this study, a feature enhancement method is proposed. This method is based on the principle of neighborhood consistency, which helps mitigate the impact of local disturbances on results.

As shown in Figure 3, this module uses neighborhood consistency based on multi-scale features to smooth local differences and avoid interference from local noise. Additionally, residual operations maintain strong high-resolution representations, which ensure clear segmentation boundaries.

Firstly, the low-resolution decoding feature

D_{n - 1}

is concatenated with the adjacent multi-scale features

A F_{n}

and

A F_{n + 1}

. This is expressed as follows:

f_{c o n c a t} = C o n c a t (A F_{i}, A F_{i + 1}, D_{i + 1})

(9)

Then,

f_{c o n c a t}

is passed through different kernel sizes for feature smoothing. This considers neighborhood feature representations under different receptive fields and helps mitigate feature ambiguity caused by noise interference. This process can be expressed as follows:

f_{c 3} = C o n v_{3 \times 3} (f_{c o n c a t});

(10)

f_{c 5} = C o n v_{5 \times 5} (f_{c o n c a t})

(11)

where

C o n v_{3 \times 3}

denotes a convolution operation with a

3 \times 3

kernel, and

C o n v_{5 \times 5}

represents a convolution operation with a

5 \times 5

kernel.

Next, adjacent expressions are used as guidance. Multi-neighborhood features are then integrated to correct local errors. Finally, residual operations are applied to ensure accurate segmentation boundaries. The specific operations are as follows:

f_{n} = C o n v_{1 \times 1} (C o n c a t (f_{c o n c a t}, f_{c 3}, f_{c 5}));

(12)

D_{i} = A F_{i} + f_{n}

(13)

3.4. Loss Function

This study uses cross-entropy [37] and Dice loss [38] to measure the difference between the predictions of the model and the ground truth. In this paper, the weights

α

and

β

are both set to 1. This can be expressed as follows:

L_{L o s s} = α L_{C E} + β L_{D i c e}

(14)

The goal of cross-entropy loss is to minimize the negative log difference between the true labels and the predicted probabilities of the model. Cross-entropy loss for multi-class classification is defined as follows:

L_{C E} = - \sum_{i = 1}^{C} y_{i} log (p_{i})

(15)

where C represents the number of classes in the dataset.

p_{i}

is the predicted probability for the i-th class at a given pixel.

y_{i}

is the corresponding ground truth.

The purpose of Dice loss is to make the predicted semantic area closely match the ground truth (GT). This helps effectively address the class imbalance issue in the scene. Dice loss is defined as follows:

L_{D i c e} = \frac{1}{C} \sum_{i = 1}^{C} (1 - \frac{2 \sum p_{i} y_{i}}{\sum p_{i} + \sum y_{i}})

(16)

where

\sum p_{i}

represents the total area of the predicted semantic class, and

\sum y_{i}

is the corresponding ground truth.

4. Experiments

4.1. Experiment Settings

All experiments were implemented using PyTorch 2.0 and trained on a single NVIDIA RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The Adam [39] optimizer was adopted with an initial learning rate of

1 \times 10^{- 3}

, and the weight decay parameter was set to 0.01. Convolution operations used default settings, with a kernel size of 1 and a stride of 1. The proposed method is evaluated on three datasets: ISPRS Potsdam, UAVid, and LoveDA. During training, all datasets underwent data augmentation techniques, including random rotation, vertical flipping, cropping, and scaling [40]. Each dataset was trained for 100 epochs with a batch size of 4. For performance evaluation, we employed the F1-score, intersection over union (IoU), and overall accuracy (OA) as evaluation metrics.

4.2. Dataset Description

4.2.1. ISPRS Potsdam Dataset

The ISPRS Potsdam dataset [41] consists of 38 aerial remote sensing images of the city of Potsdam, Germany. Each image has a resolution of 0.05 m and a pixel size of

6000 \times 6000

. The training set consists of 23 images, while the remaining images are used as the test set. The images cover a wide area with different land cover categories, which are classified into six types: pavement, trees, low vegetation, buildings, cars, and clutter. The dataset contains a digital surface model (DSM) and true orthophoto (TOP) images with three spectral channel combinations: IR-R-G, R-G-B, and R-G-B-IR. In this study, only RGB channel images are used, while other spectral channels and DSM data are excluded. All images are divided into patches of size

1024 \times 1024

for model training and testing.

4.2.2. UAVid Dataset

The UAVid dataset [42] contains images of urban environments captured by UAVs from an oblique perspective. The samples cover typical urban elements, and include six semantic categories: buildings, transportation networks, vegetation, vehicles, pedestrians, and urban facilities. Among them, 200 images are used for training, 70 for validation, and 120 for testing. The images are in

3840 \times 2160

and

4096 \times 2160

pixel formats. In this study, they are uniformly divided into

1024 \times 1024

pixel patches as input. Compared to top-view images, this dataset provides more detailed descriptions of small-scale samples in urban environments and more diverse structural representations of urban scenes, making it suitable for testing the generalization of networks.

4.2.3. LoveDA Dataset

The LoveDA dataset [43] represents the geomorphic characteristics of major Chinese cities such as Nanjing, Changzhou, and Wuhan. It consists of satellite remote sensing images with a resolution of 0.3 m, with a total of 5987 images. Among them, 2522 images are used for training, 1669 for validation, and 1796 for testing. Each image has a size of

1024 \times 1024

pixels. The dataset is mainly divided into two categories: urban and rural areas. Urban regions contain dense human-made structures with diverse colors, but the object sizes are generally small. In contrast, rural areas include natural environments with more uniform color tones and larger object scales. This distribution creates significant challenges for network stability. Thus, the dataset can effectively tests the model’s adaptability and robustness in different environments.

4.3. Results and Analysis

4.3.1. Experimental Results on the ISPRS Potsdam Dataset

Figure 4 shows the visualization results on the Potsdam dataset, illustrating the effectiveness of the proposed method in understanding urban scenes. Building rooftops often exhibit color inconsistencies due to material differences and lighting variations, which pose challenges for semantic segmentation. As shown in the second row of Figure 4a, the left image depicts texture inconsistencies on the rooftop. AerialFormer incorrectly classifies this area as clutter. In contrast, the proposed method accurately segments this region. It benefits from multi-scale feature representation and neighborhood constraint mechanisms, which effectively address intra-class texture interference. A similar phenomenon is observed in Figure 4b. The building rooftop on the left of the second row displays red and gray colors. Due to lighting differences, a dark region appears at the boundary between the two colors. AerialFormer misclassifies this region as low vegetation. In contrast, the proposed method correctly identifies this area, demonstrating the effectiveness and robustness of the strategy for addressing local texture variations.

Table 1 demonstrates that the proposed method achieves new state-of-the-art (SoTA) performance across multiple evaluation metrics on the ISPRS Potsdam dataset. Compared to the AerialFormer method [44] in 2024, the proposed method shows improvements of 0.3% in mF1, 0.4% in mIoU, and 0.8% in OA. In particular, the proposed method shows significant advantages in the imp. surf., building, and low veg. categories, with F1 score improvements of 0.3%, 0.2%, and 0.2%, respectively. Similarly, other categories achieve excellent results. Compared to the classic DeepLab v3 [45], the proposed method achieves improvements of 2.0% and 2.7% in the tree and clutter categories, respectively. These results fully validate the effectiveness of the proposed method in urban remote sensing scenarios, demonstrating that the network design is reasonable and has great application potential.

4.3.2. Experimental Results on the UAVid Dataset

To verify the effectiveness of our model for handling oblique photogrammetry tasks, we evaluated it on the UAVid dataset. Figure 5 illustrates the visual comparison between our method and AerialFormer. As shown in Figure 5a, several different semantic classes are distributed near the billboard. This phenomenon causes AerialFormer to misclassify them. A similar issue occurs in Figure 5c, showing that the vehicles are misclassified as buildings. Due to the high texture similarity and spatial proximity between vehicles and walls, it is difficult for models based on a single receptive field to accurately capture associations between different semantic categories. This hinders the ability to effectively address similar situations. In contrast, the proposed AF module enhances the interaction of scale-aware information, effectively alleviating issues. In Figure 5b, the tunnel area has shadows that result in local texture variations. With the incorporation of the NE module, shadow effects are significantly reduced. As a result, our method is closer to the ground truth. These results further validate the effectiveness and robustness of the method in handling complex urban remote sensing scenarios.

As shown in Table 2, the proposed method performs excellently on this dataset. Compared to the classic DeepLabv3+, the mIoU is improved by 3.2%. With respect to the current AerialFormer method, the mIoU is increased by 0.4%. Similarly, the performance on the building, road, and tree categories increases by 0.8%, 0.7%, and 0.4%, respectively. This result is consistent with the performance on the Potsdam dataset, demonstrating that the proposed method has strong recognition capabilities for large objects.

4.3.3. Experimental Results on the LoveDA Dataset

As shown in Figure 6a within the red box, shadows makes the textures of the barren and agricultural areas appear darker. This reduces the distinction in color between these highly similar categories, which poses a significant challenge for the segmentation task. In this case, AerialFormer has poor discriminative capability and misclassifies the barren as agricultural. In contrast, the proposed method effectively perceives multi-scale contextual information, addressing this issue. Similarly, in Figure 6d, where buildings are adjacent to shadows, the proposed model is able to accurately capture the adjacency relationship and achieve accurate segmentation, successfully avoiding mis-segmentation. In Figure 6b,c, the proposed method shows clearer boundary details and more complete contours. These results further validate the stable generalization ability of the method across different datasets, fully demonstrating its strong practicality and robustness. To further verify the generalization and robustness, this study compares the proposed method with several excellent methods based on the LoveDA dataset. As shown in Table 3, the classical method UNet++ [63] achieves 48.2% in mIoU, while the proposed method reaches 54.8%, representing an improvement of 6.6%. Compared to the recent AerialFormer, the proposed method shows an increase of 0.7%. In this dataset, the building category still achieves the highest score. In particular, the proposed method achieves a score of 25.0% in the barren category, which is at least 4% higher than other methods. This improvement benefits from the interaction of contextual information and attention to neighborhood relationships. Surprisingly, the proposed method also achieves the best performance in the background category. This may be due to the large number of shaded areas in satellite images. These shadows reduce the representation of pixel-level detail, which decreases the complexity and variability of the background class and increases its dependence on neighborhood information. The proposed method focuses on the representation of the neighborhood and achieves a higher score as a result.

4.4. Ablation Study

To verify the effectiveness of each proposed module, ablation experiments were conducted on the ISPRS Potsdam and LoveDA datasets. Specifically, ablations were performed on the AF module and the NE module. The settings of each ablation model are described below:

Baseline: U-Net [24] is used as the backbone network, and the encoder is replaced by ResNet-18. The AF module is replaced by the dense skip connections of U-Net++, and the NE module is replaced by a $1 \times 1$ convolution operation. This method serves as the baseline model for the ablation experiments, which aim to verify the effectiveness of each module proposed in this study.
Only-AF: The skip connections in the baseline model is replaced with the AF module. The objective is to verify that the introduction of neighborhood graph relationships can enhance the positive representation of features within the same semantic class, improving the ability of the network to recognize objects.
Only-NE: The decoder module in the baseline is replaced with the NE module. The objective is to demonstrate that the incorporation of neighborhood graph relationships can augment the positive representation of features within the same semantic class, enhancing the capacity of the network to recognize objects.
Ours: The objective of this experiment is to verify that the proposed AF and NE are functionally complementary, demonstrating their ability to enhance the overall model performance collaboratively, rather than being functionally redundant.

To effectively evaluate the performance of each method, mIoU is used as the evaluation metric. As shown in Table 4, the Only-AF model achieved a score of 79.4% on the Potsdam dataset, reflecting improvements of 2.1% and 1.2% over the baseline and Only-NE, respectively. On the LoveDA dataset, Only-AF’s score increased by 5.6% and 2.3% compared to the baseline and Only-NE, respectively. These results demonstrate the effectiveness of the proposed modules in addressing specific challenges and improving the accuracy of semantic segmentation tasks. Moreover, Only-AF achieves a faster inference speed, which is primarily the result of removing redundant skip connections in U-Net++. Experimental results show that Only-NE performs slightly worse than Only-AF. However, combining both modules improves performance by 0.7% and 1.1%, respectively. This shows that the two modules focus on different, non-overlapping problems and work together to improve segmentation accuracy.

5. Discussion

5.1. Discussion on Computational Efficiency

To demonstrate the effectiveness of the proposed method, we conducted comparative experiments on the Potsdam and LoveDA datasets. Table 5 show that our method works well across different evaluation metrics. It achieves mIoU scores of 80.1% and 54.8% on the two datasets, respectively. These results outperform those of the classical CNN-based UNet and Transformer-based TransUnet. Compared to AerialFormer-B, our method achieves a 0.7% improvement on the LoveDA dataset. Furthermore, it offers faster inference speed. These findings strongly support the superiority of the proposed method for semantic segmentation tasks.

5.2. Discussion on Loss

In this study, a combined loss function is used. To verify the effectiveness of this combined loss, a comparative experiment was conducted using different loss function configurations:

Only-CE: Uses only cross-entropy loss (CE loss), which constrains the classification results of each individual pixel.
Only-Dice: Uses only Dice loss, which focuses on the overall region matching.
Ours: Uses the combined loss function, which integrates the cross-entropy loss and Dice loss, to constrain network training at both the pixel and region level.

Table 6 presents the experimental results on the Potsdam dataset. Only-CE achieves high accuracy in overall segmentation. It performs especially well on categories with many samples and consistent color patterns in the images, such as low veg. (79.7%) and tree (81.4%). However, its performance on clutter is 2.0% lower than that of Only-Dice due to fewer samples and more diverse color features. This suggests that CE loss has certain limitations when dealing with small sample sizes and complex feature classes. In contrast, Dice loss shows advantages in identifying these difficult targets and improves recognition of the clutter class due to its direct constraint on region overlap, although its overall mIoU performance is relatively lower, while the combination of CE loss and Dice loss does not achieve the highest scores for individual semantic categories, it yields the best overall segmentation performance. The combined method improves mIoU by 0.1% compared to Only-CE, and by 0.6% compared to Only-Dice. This suggests that a combined loss function improves the overall representational ability of the model. It is well suited for semantic recognition tasks in large-scale remote sensing imagery.

To further validate the above conclusions, we conducted additional experiments using the LoveDA dataset. As shown in Table 7, our method continues to perform well, achieving mIoU improvements of 0.5% and 1.1% over Only-CE and Only-Dice, respectively. It is worth noting that Only-Dice performs particularly well in the water, barren, and agriculture categories. These categories generally exhibit low texture distinctiveness in the LoveDA dataset, which often leads to weak feature representation. In such cases, the pixel-level supervision mechanism of CE loss is affected by noise and inconsistencies. In contrast, the region-level constraint of Dice loss better handles such cases and enhances segmentation performance. Conversely, pixel-level supervision is more effective for categories like forest, which exhibit distinct texture features. These findings align with the results on the Potsdam dataset and further confirm the complementary strengths of CE and Dice losses. Thus, a joint optimization strategy combining both is beneficial for improving overall semantic segmentation performance.

5.3. Discussion on Loss Hyperparameters

The weight settings of each component in the combined loss function have a significant impact on the accuracy of the network. Thus, we systematically investigate this hyperparameters in this study. Specifically, the parameter

α

controls the weight of the CE loss, while the parameter

β

represents the weight of the Dice loss. In the comparative experiments,

α + β

was consistently kept at 1.

As shown in Table 8 and Table 9, we analyze the impact of various weight settings on model performance. To ensure consistency and generalizability, the analysis is conducted on both the Potsdam and LoveDA datasets. On the Potsdam dataset, as the weight of

β

increases, the IoU scores of different semantic categories exhibit a consistent trend: the score for the clutter category gradually decreases, while the score for the low veg. category steadily increases. This trend is consistent with the analysis presented in Section 5.2. With

α = 0.7

and

β = 0.3

, the model achieves an mIoU of 80.6%, indicating an effective balance between CE loss and Dice loss. In contrast, on the LoveDA dataset, the optimal hyperparameter configuration is found at

α = 0.6

and

β = 0.4

. This difference may be attributed to the large proportion of water, barren, and agriculture categories in the dataset, which benefit more from supervision based on Dice loss. This result is also supported by the conclusions in Section 5.2. Although the configuration with

α = β = 1

does not produce optimal results for each dataset, it demonstrates stable performance across multiple datasets. Therefore, this setting is adopted for the experiments in this study.

6. Conclusions

This paper proposes AFNE-Net, a novel neural network model for understanding remote sensing images. To address the challenges posed by variations in object scale and insufficient attention to local regions, the AF module is designed. This module uses a multi-resolution feature fusion strategy to create multi-scale feature representations. It also incorporates an attention mechanism to fuse global and local features, enhancing local perception. Additionally, the NE module is introduced to address local texture inconsistency caused by lighting variations, shadow interference, and internal texture differences. This module employs the principle of neighborhood consistency. By leveraging spatial similarity among pixels, the module effectively mitigates mis-segmentation caused by local texture inconsistencies, enhancing the understanding and robustness of the model in large-scale remote sensing scenes.

A comprehensive evaluation of AFNE-Net was conducted using three datasets: Potsdam, UAVID and LoveDA. The experimental results demonstrate that the proposed method achieves high accuracy and stability in addressing multi-class semantic regions and local mis-segmentation issues. This highlights the potential and practical value of the method in remote sensing image segmentation tasks.

In future work, we will explore lightweight improvements and introduce new strategies to expand the model’s applicability to shadow-heavy problems. These improvements will enable the model to adapt more effectively to large-scale, multi-source, heterogeneous remote sensing scenarios, expanding its range of practical applications.

Author Contributions

Conceptualization, K.L., H.J. and Z.L.; methodology, K.L., H.J. and Z.L.; software, K.L., Z.C. and C.L.; validation, K.L. and Z.C.; writing—original draft preparation, K.L. and C.L.; writing—review and editing, H.J. and Z.L.; supervision, K.L., H.J. and Z.L.; project administration, H.J. and Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Key Research and Development Program of China (NO. 2021YFC2802502).

Data Availability Statement

In this study, we used three datasets: Potsdam, UAVid, and LoveDA. The Potsdam dataset is available at the following link: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (accessed on 1 May 2025). The UAVid dataset can be accessed at: https://uavid.nl/ (accessed on 1 May 2025). The LoveDA dataset is publicly available at: https://zenodo.org/records/5706578 (accessed on 1 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bastani, F.; Wolters, P.; Gupta, R.; Ferdinando, J.; Kembhavi, A. Satlaspretrain: A large-scale dataset for remote sensing image understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 16772–16782. [Google Scholar]
Hansen, M.C.; Defries, R.S.; Townshend, J.R.G.; Sohlberg, R. Global land cover classification at 1 km spatial resolution using a classification tree approach. Int. J. Remote Sens. 2000, 21, 1331–1364. [Google Scholar] [CrossRef]
Ma, X.; Lian, R.; Wu, Z.; Guo, H.; Yang, F.; Ma, M.; Wu, S.; Du, Z.; Zhang, W.; Song, S. LOGCAN++: Adaptive Local-Global Class-Aware Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4404216. [Google Scholar] [CrossRef]
Li, Y.; Liu, S.; Wu, J.; Sun, W.; Wen, Q.; Wu, Y.; Qin, X.; Qiao, Y. Multi-Scale Kolmogorov-Arnold Network (KAN)-Based Linear Attention Network: Multi-Scale Feature Fusion with KAN and Deformable Convolution for Urban Scene Image Semantic Segmentation. Remote Sens. 2025, 17, 802. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Wang, L.; Zhang, C. Land cover classification from remote sensing images based on multi-scale fully convolutional network. Geo-Spatial Inf. Sci. 2022, 25, 278–294. [Google Scholar] [CrossRef]
Marcos, D.; Volpi, M.; Kellenberger, B.; Tuia, D. Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models. ISPRS J. Photogramm. Remote Sens. 2018, 145, 96–107. [Google Scholar] [CrossRef]
Ye, Z.; Li, Y.; Li, Z.; Liu, H.; Zhang, Y.; Li, W. Attention Multiscale Network for Semantic Segmentation of Multimodal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5610315. [Google Scholar] [CrossRef]
Li, Y.; Zhu, W.; Wu, J.; Zhang, R.; Xu, X.; Zhou, Y. DBSANet: A Dual-Branch Semantic Aggregation Network Integrating CNNs and Transformers for Landslide Detection in Remote Sensing Images. Remote Sens. 2025, 17, 807. [Google Scholar] [CrossRef]
Zhao, L.; Ye, L.; Zhang, M.; Jiang, H.; Yang, Z.; Yang, M. DPSDA-Net: Dual-Path Convolutional Neural Network with Strip Dilated Attention Module for Road Extraction from High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 3741. [Google Scholar] [CrossRef]
Chakraborty, D.; Deka, B. UAV sensing-based semantic image segmentation of litchi tree crown using deep learning. In Proceedings of the 2023 IEEE Applied Sensing Conference (APSCON), Bengaluru, India, 23–25 January 2023; pp. 1–3. [Google Scholar]
Li, Y.; Liu, W.; Ge, Y.; Yuan, S.; Zhang, T.; Liu, X. Extracting Citrus-Growing Regions by Multiscale UNet Using Sentinel-2 Satellite Imagery. Remote Sens. 2024, 16, 36. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef]
Ma, J.; Liang, P.; Yu, W.; Chen, C.; Guo, X.; Wu, J.; Jiang, J. Infrared and visible image fusion via detail preserving adversarial learning. Inf. Fusion 2020, 54, 85–98. [Google Scholar] [CrossRef]
Yin, P.; Zhang, D.; Han, W.; Li, J.; Cheng, J. High-Resolution Remote Sensing Image Semantic Segmentation via Multiscale Context and Linear Self-Attention. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 9174–9185. [Google Scholar] [CrossRef]
Qi, X.; Zhu, P.; Wang, Y.; Zhang, L.; Peng, J.; Wu, M.; Chen, J.; Zhao, X.; Zang, N.; Mathiopoulos, P.T. MLRSNet: A multi-label high spatial resolution remote sensing dataset for semantic scene understanding. ISPRS J. Photogramm. Remote Sens. 2020, 169, 337–350. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Bi, H.; Feng, Y.; Tong, B.; Wang, M.; Yu, H.; Mao, Y.; Chang, H.; Diao, W.; Wang, P.; Yu, Y.; et al. RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation. arXiv 2025, arXiv:2504.03166. [Google Scholar]
Zhong, J.; Zeng, T.; Xu, Z.; Wu, C.; Qian, S.; Xu, N.; Chen, Z.; Lyu, X.; Li, X. A Frequency Attention-Enhanced Network for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2025, 17, 402. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Yu, A.; Lyu, X.; Gao, H.; Zhou, J. A Frequency Decoupling Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5607921. [Google Scholar] [CrossRef]
Wan, S.; Gong, C.; Zhong, P.; Du, B.; Zhang, L.; Yang, J. Multiscale dynamic graph convolutional network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3162–3177. [Google Scholar] [CrossRef]
Wang, S.; Wang, Y.; Yue, J.; Liang, H.; Zhang, Z.; Li, B. UniU-Net: A Unified U-Net Deep Learning Approach for High-Precision Areca Palm Segmentation in Remote Sensing Imagery. Appl. Sci. 2025, 15, 4813. [Google Scholar] [CrossRef]
Tian, X.; Bai, L.; Mo, D. A Garbage Detection and Classification Model for Orchards Based on Lightweight YOLOv7. Sustainability 2025, 17, 3922. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
Wang, G.; Chen, J.; Mo, L.; Wu, P.; Yi, X. Border-Enhanced Triple Attention Mechanism for High-Resolution Remote Sensing Images and Application to Land Cover Classification. Remote Sens. 2024, 16, 2814. [Google Scholar] [CrossRef]
Hu, D.; Zheng, Z.; Liu, Y.; Liu, C.; Zhang, X. Axial-UNet++ Power Line Detection Network Based on Gated Axial Attention Mechanism. Remote Sens. 2024, 16, 4585. [Google Scholar] [CrossRef]
Du, B.; Shan, L.; Shao, X.; Zhang, D.; Wang, X.; Wu, J. Transform Dual-Branch Attention Net: Efficient Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images. Remote Sens. 2025, 17, 540. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, L. Artificial intelligence for remote sensing data analysis: A review of challenges and opportunities. IEEE Geosci. Remote Sens. Mag. 2022, 10, 270–294. [Google Scholar] [CrossRef]
Li, J.; Cheng, S. AFENet: An Attention-Focused Feature Enhancement Network for the Efficient Semantic Segmentation of Remote Sensing Images. Remote Sens. 2024, 16, 4392. [Google Scholar] [CrossRef]
Zhu, S.; Zhao, L.; Xiao, Q.; Ding, J.; Li, X. GLFFNet: Global–Local Feature Fusion Network for High-Resolution Remote Sensing Image Semantic Segmentation. Remote Sens. 2025, 17, 1019. [Google Scholar] [CrossRef]
He, B.; Wu, D.; Wang, L.; Xu, S. FA-HRNet: A New Fusion Attention Approach for Vegetation Semantic Segmentation and Analysis. Remote Sens. 2025, 16, 4194. [Google Scholar] [CrossRef]
Wang, X.; Fan, Z.; Jiang, Z.; Yan, Y.; Yang, H. EDFF-Unet: An Improved Unet-Based Method for Cloud and Cloud Shadow Segmentation in Remote Sensing Images. Remote Sens. 2025, 17, 1432. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the International Conference on Machine Learning, PMLR, Edmonton, AB, Canada, 30 June–3 July 2023; pp. 23803–23828. [Google Scholar]
Li, X.; Sun, X.; Meng, Y.; Liang, J.; Wu, F.; Li, J. Dice Loss for Data-imbalanced NLP Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 465–476. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Pradeep Reddy, G.; Rohan, D.; Kareem, S.M.A.; Venkata Pavan Kumar, Y.; Purna Prakash, K.; Janapati, M. A Custom Convolutional Neural Network Model-Based Bioimaging Technique for Enhanced Accuracy of Alzheimer’s Disease Detection. Eng. Proc. 2025, 87, 47. [Google Scholar]
Interational Society for Photogrammetry and Remote Sensing. 2D Semantic Labeling Contest-Potsdam. 2018. Available online: https://www.isprs.org/education/benchmarks/UrbanSemLab/ (accessed on 9 January 2025).
Lyu, Y.; Vosselman, G.; Xia, G.S.; Yilmaz, A.; Yang, M.Y. UAVid: A semantic segmentation dataset for UAV imagery. ISPRS J. Photogramm. Remote Sens. 2020, 165, 108–119. [Google Scholar] [CrossRef]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Online, 6–14 December 2021; Vanschoren, J., Yeung, S., Eds.; Zenodo: Meyrin, Switzerland, 2021; Volume 1. Available online: https://github.com/Junjue-Wang/LoveDA (accessed on 10 July 2025).
Hanyu, T.; Yamazaki, K.; Tran, M.; McCann, R.A.; Liao, H.; Rainwater, C.; Adkins, M.; Cothren, J.; Le, N. AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation. Remote Sens. 2024, 16, 2930. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified Perceptual Parsing for Scene Understanding. In Proceedings of the Computer Vision—ECCV 2018, 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part V. Springer: Cham, Switzerland, 2018; pp. 432–448. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Computer Vision—ECCV 2018, 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part V. Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 833–851. [Google Scholar]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. DenseASPP for Semantic Segmentation in Street Scenes. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3141–3149. [Google Scholar]
Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-Maximization Attention Networks for Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9166–9175. [Google Scholar]
Huang, Z.; Wang, X.; Wei, Y.; Huang, L.; Shi, H.; Liu, W.; Huang, T.S. CCNet: Criss-Cross Attention for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6896–6908. [Google Scholar] [CrossRef]
Li, H.; Qiu, K.; Chen, L.; Mei, X.; Hong, L.; Tao, C. SCAttNet: Semantic Segmentation Network With Spatial and Channel Attention Mechanism for High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 18, 905–909. [Google Scholar] [CrossRef]
Huang, Z.; Yu, Y.; Xu, J.; Ni, F.; Le, X. PF-Net: Point Fractal Network for 3D Point Cloud Completion. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7659–7667. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems 34, Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, Online, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: New York, NY, USA, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Sun, L.; Zou, H.; Wei, J.; Cao, X.; He, S.; Li, M.; Liu, S. Semantic Segmentation of High-Resolution Remote Sensing Images Based on Sparse Self-Attention and Feature Alignment. Remote Sens. 2023, 15, 1598. [Google Scholar] [CrossRef]
Wang, T.; Xu, C.; Liu, B.; Yang, G.; Zhang, E.; Niu, D.; Zhang, H. MCAT-UNet: Convolutional and Cross-Shaped Window Attention Enhanced UNet for Efficient High-Resolution Remote Sensing Image Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 9745–9758. [Google Scholar] [CrossRef]
Huang, G.; Chen, D.; Li, T.; Wu, F.; van der Maaten, L.; Weinberger, K. Multi-Scale Dense Networks for Resource Efficient Image Classification. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Liu, Y.; Chen, H.; Shen, C.; He, T.; Jin, L.; Wang, L. ABCNet: Real-Time Scene Text Spotting With Adaptive Bezier-Curve Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Li, Y.; Li, X.; Dai, Y.; Hou, Q.; Liu, L.; Liu, Y.; Cheng, M.M.; Yang, J. LSKNet: A Foundation Lightweight Backbone for Remote Sensing. Int. J. Comput. Vis. 2025, 133, 1410–1431. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4. Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Chaurasia, A.; Culurciello, E. LinkNet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; pp. 1–4. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A. Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4095–4104. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Online, 11–17 October 2021; pp. 7242–7252. [Google Scholar]
Ma, A.; Wang, J.; Zhong, Y.; Zheng, Z. FactSeg: Foreground Activation-Driven Small Object Semantic Segmentation in Large-Scale Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5606216. [Google Scholar] [CrossRef]
Xu, R.; Wang, C.; Zhang, J.; Xu, S.; Meng, W.; Zhang, X. RSSFormer: Foreground Saliency Enhancement for Remote Sensing Land-Cover Segmentation. IEEE Trans. Image Process. 2023, 32, 1052–1064. [Google Scholar] [CrossRef]
Chen, Y.; Fang, P.; Zhong, X.; Yu, J.; Zhang, X.; Li, T. Hi-ResNet: Edge Detail Enhancement for High-Resolution Remote Sensing Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 15024–15040. [Google Scholar] [CrossRef]
Wang, L.; Dong, S.; Chen, Y.; Meng, X.; Fang, S.; Fei, S. MetaSegNet: Metadata-Collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5644211. [Google Scholar] [CrossRef]
Yu, X.; Dai, P.; Li, W.; Ma, L.; Shen, J.; Li, J.; Qi, X. Towards Efficient and Scale-Robust Ultra-High-Definition Image Demoiréing. In Proceedings of the Computer Vision—ECCV 2022, 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XVIII. Springer: Cham, Switzerland, 2022; pp. 646–662. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Proceedings of the Computer Vision—ECCV 2022, 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XVIII. Springer: Cham, Switzerland, 2023; pp. 205–218. [Google Scholar]

Figure 1. Overall architecture of our network, AFNE-Net. The network consists of four layers with feature dimensions of (32, H/4, W/4), (48, H/8, W/8), (136, H/16, W/16), and (384, H/32, W/32), respectively.

Figure 2. The proposed dilated attention-based multi-scale feature fusion (AF) module.

Figure 3. The proposed dilated neighborhood feature enhancement (NE) module.

Figure 4. Visual comparison of AerialFormer and our method on ISPRS Potsdam. (a,b) show the selected representative samples.

Figure 5. Visual comparison of AerialFormer and our method on UAVid. (a–c) show the selected representative samples.

Figure 6. Visual comparison of AerialFormer and our method on LoveDA. (a–d) show the selected representative samples.

Table 1. Quantitative comparison results on the ISPRS Potsdam dataset (%). The best results are presented in bold.

Method	Year ↑	OA	mIoU	mF1	F1
Method	Year ↑	OA	mIoU	mF1	Imp. Surf.	Building	Low Veg.	Tree	Car	Clutter
FCN [23]	2015	—	64.2	75.9	87.6	91.6	77.8	84.6	73.5	40.3
PSPNet [25]	2017	90.1	77.1	85.6	92.6	96.2	82.6	87.5	93.5	55.4
DeepLabv3 [46]	2017	90.0	77.2	85.6	92.4	96.0	82.0	87.1	94.3	56.7
UPerNet [47]	2018	91.0	77.8	85.6	92.6	96.1	83.0	87.8	94.9	58.0
DeepLabv3+ [48]	2018	90.7	77.1	85.6	92.2	96.4	82.6	87.8	94.8	55.1
DenseASPP [49]	2018	89.6	73.9	83.0	90.6	94.4	80.2	85.0	91.3	43.2
DANet [50]	2019	—	65.6	77.1	88.5	92.7	78.5	85.7	72.3	48.9
EMANet [51]	2019	—	65.6	77.7	88.2	92.5	78.0	85.7	72.7	48.9
CCNet [52]	2019	—	66.3	77.9	88.3	92.6	78.3	85.5	73.2	49.1
SCAttNet [53]	2020	88.0	68.3	78.4	81.8	93.2	78.8	86.5	75.6	30.2
PFNet [54]	2021	90.5	75.4	84.8	92.5	96.4	82.4	88.5	93.4	41.5
Segformer [55]	2021	91.3	77.5	86.7	93.5	96.7	86.6	88.9	95.3	50.1
LOGCAN++ [3]	2023	—	78.6	86.6	87.5	95.8	87.3	88.3	96.4	51.2
SAANet [56]	2023	88.2	78.3	86.2	92.6	96.4	85.8	87.2	93.2	43.1
MCAT-UNet [57]	2024	88.3	75.4	84.8	92.5	94.3	82.5	74.3	86.8	41.5
AerialFormer-B [44]	2024	91.4	79.7	87.6	93.5	97.2	88.1	89.3	95.7	61.9
Ours	-	92.2	80.1	87.9	93.8	97.4	88.3	89.1	97.9	59.4

Table 2. Quantitative comparison results on the UAVid dataset (%). The best results are presented in bold.

Method	Year ↑	mIoU	Building	Road	Tree	Vegetation	Moving Car	Static Car	Human	Clutter
MSDNet [58]	2018	57.0	57.0	79.8	74.0	74.4	55.9	62.9	32.1	19.7
DeepLabv3+ [48]	2018	67.4	66.7	87.6	80.0	79.5	62.0	71.7	68.6	22.8
DANet [50]	2019	64.9	58.9	77.9	68.3	61.5	59.6	47.4	9.1	60.6
ABCNet [59]	2020	67.4	86.4	81.2	79.9	63.1	69.8	48.4	13.9	63.8
BANet [60]	2021	66.7	85.4	80.7	78.9	62.1	69.3	52.8	21.0	64.6
SegFormer [55]	2021	66.6	86.3	80.1	79.6	62.3	72.5	52.5	28.5	66.0
UNetFormer [61]	2022	68.4	87.4	81.5	80.2	63.5	73.6	56.4	31.0	67.8
LSKNet-S [62]	2024	70.0	84.8	82.9	80.9	65.5	76.8	64.9	31.8	69.6
AerialFormer [44]	2024	70.2	89.0	82.7	80.8	64.6	77.5	67.5	30.7	70.4
Ours	—	70.6	90.2	83.4	81.2	64.8	76.9	67.2	29.8	70.5

Table 3. Quantitative comparison results on the LoveDA dataset (%). The best results are presented in bold.

Method	Year ↑	mIoU	Background	Building	Road	Water	Barren	Forest	Agriculture
FCN [23]	2015	46.7	42.6	49.5	48.1	73.1	11.8	43.5	58.3
UNet [24]	2015	47.8	43.1	52.7	52.8	73.1	10.3	43.1	59.9
LinkNet [64]	2017	48.5	43.6	52.1	52.5	76.9	12.2	45.1	57.3
SegNet [65]	2017	47.3	41.8	51.8	51.8	74.5	10.9	42.9	56.7
UNet++ [63]	2018	48.2	42.9	52.6	52.8	74.5	11.4	44.4	58.8
DeeplabV3+ [48]	2018	47.6	43.0	50.9	52.0	74.4	10.4	44.2	58.5
FarSeg [66]	2020	48.2	43.4	51.8	53.3	76.1	10.8	43.2	58.6
Segmenter [67]	2021	47.1	38.0	50.7	48.7	77.4	13.3	43.5	58.2
Segformer [55]	2021	49.1	42.2	56.4	50.7	78.5	17.2	45.2	53.8
FactSeg [68]	2022	48.9	42.6	53.6	52.8	76.9	16.2	42.9	57.5
UNetFormer [61]	2023	52.4	44.7	58.8	54.9	79.6	20.1	46.0	62.5
RSSFormer [69]	2023	52.4	52.4	60.7	55.2	76.3	18.7	45.4	58.3
LOGCAN++ [3]	2023	53.4	47.4	58.4	56.5	80.1	18.4	47.9	64.8
Hi-ResNet [70]	2023	52.5	46.7	58.3	55.9	80.1	17.0	46.7	62.7
MetaSegNet [71]	2023	52.2	44.0	57.9	58.1	79.9	18.2	47.7	59.4
ESDNet [72]	2024	50.1	41.6	53.8	54.8	78.7	19.5	44.2	58.0
AerialFormer-B [44]	2024	54.1	47.8	60.7	59.3	81.5	17.9	47.9	64.0
Ours	—	54.8	52.4	64.1	57.1	81.6	25.0	44.5	60.1

Table 4. Ablation study of AF and NE on the ISPRS Potsdam and LoveDA datasets.

Method	AF	NE	Params (M)	Inference Time (s)	mIoU
Method	AF	NE	Params (M)	Inference Time (s)	Potsdam	LoveDA
Baseline	–	–	50.6	0.036	77.3	49.5
Only-AF	✓	–	98.3	0.032	79.4	53.7
Only-NE	–	✓	71.4	0.046	78.2	51.4
Ours	✓	✓	106.3	0.044	80.1	54.8

Table 5. A comparative analysis of model complexity and performance for different semantic segmentation methods on the Potsdam dataset is conducted. The table lists the number of model parameters (Params), computational cost (GFLOPs), inference time, and mIoU scores.

Method	Year ↑	Params (M)	FLOPs (GB)	Inference Time (s)	mIoU
Method	Year ↑	Params (M)	FLOPs (GB)	Inference Time (s)	Potsdam	LoveDA
Unet [24]	2015	31.0	184.6	0.020	65.5	47.8
TransUnet [73]	2021	90.7	233.7	0.023	70.3	49.0
SwinUNet [74]	2022	41.4	237.4	0.021	78.6	50.2
AerialFormer-B [44]	2024	113.8	126.8	0.047	79.7	54.1
Ours	-	106.3	122.8	0.044	80.1	54.8

Table 6. Comparison of Different Loss Functions on the ISPRS Potsdam Dataset.

Method	CE	Dice	Imp. Surf.	Building	Low Veg.	Tree	Car	Clutter	mIoU
Only-CE	✓	×	87.3	95.1	79.7	81.4	95.1	41.0	80.0
Only-Dice	×	✓	86.7	94.5	77.7	80.2	94.6	43.0	79.5
Ours	✓	✓	88.3	94.9	79.0	80.3	95.9	42.2	80.1

Table 7. Comparison of Different Loss Functions on the LoveDA Dataset.

Method	CE	Dice	Background	Building	Road	Water	Barren	Forest	Agriculture	mIoU
Only-CE	✓	×	51.2	63.3	59.2	79.3	20.8	47.4	59.5	54.3
Only-Dice	×	✓	49.2	61.2	54.6	82.2	26.3	42.6	60.5	53.7
Ours	✓	✓	52.4	64.1	57.1	81.6	25.0	44.5	60.1	54.8

Table 8. Comparison of Different Loss Weight Settings on the ISPRS Potsdam Dataset.

Method	$α$	$β$	Imp.Surf	Building	Low Veg.	Tree	Car	Clutter	mIoU
Method-1	0.6	0.4	88.0	94.5	78.6	80.9	96.6	42.7	80.2
Method-2	0.7	0.3	88.2	94.6	79.9	81.3	96.2	42.5	80.6
Method-3	0.8	0.2	86.9	95.1	79.5	81.1	93.9	41.8	79.8
Ours	1.0	1.0	88.3	94.9	79.0	80.3	95.9	42.2	80.1

Table 9. Comparison of different loss weight settings on the LoveDA Dataset.

Method	$α$	$β$	Background	Building	Road	Water	Barren	Forest	Agriculture	mIoU
Method-1	0.6	0.4	52.6	63.9	58.2	82.1	23.1	47.7	59.5	55.2
Method-2	0.7	0.3	50.3	60.8	58.9	81.6	22.3	47.2	60.4	54.6
Method-3	0.8	0.2	51.8	63.2	58.3	81.5	21.9	46.3	59.2	54.5
Ours	1.0	1.0	52.4	64.1	57.1	81.6	25.0	44.5	60.1	54.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, K.; Ji, H.; Li, Z.; Cui, Z.; Liu, C. AFNE-Net: Semantic Segmentation of Remote Sensing Images via Attention-Based Feature Fusion and Neighborhood Feature Enhancement. Remote Sens. 2025, 17, 2443. https://doi.org/10.3390/rs17142443

AMA Style

Li K, Ji H, Li Z, Cui Z, Liu C. AFNE-Net: Semantic Segmentation of Remote Sensing Images via Attention-Based Feature Fusion and Neighborhood Feature Enhancement. Remote Sensing. 2025; 17(14):2443. https://doi.org/10.3390/rs17142443

Chicago/Turabian Style

Li, Ke, Hao Ji, Zhijiang Li, Zeyu Cui, and Chengkai Liu. 2025. "AFNE-Net: Semantic Segmentation of Remote Sensing Images via Attention-Based Feature Fusion and Neighborhood Feature Enhancement" Remote Sensing 17, no. 14: 2443. https://doi.org/10.3390/rs17142443

APA Style

Li, K., Ji, H., Li, Z., Cui, Z., & Liu, C. (2025). AFNE-Net: Semantic Segmentation of Remote Sensing Images via Attention-Based Feature Fusion and Neighborhood Feature Enhancement. Remote Sensing, 17(14), 2443. https://doi.org/10.3390/rs17142443

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AFNE-Net: Semantic Segmentation of Remote Sensing Images via Attention-Based Feature Fusion and Neighborhood Feature Enhancement

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning for Image Segmentation

2.2. Transformer Mechanism

2.3. Multi-Scale Feature Fusion

3. Methods

3.1. Encoder Module

3.2. Attention Feature Fusion Module

3.3. Feature Enhance Decoder

3.4. Loss Function

4. Experiments

4.1. Experiment Settings

4.2. Dataset Description

4.2.1. ISPRS Potsdam Dataset

4.2.2. UAVid Dataset

4.2.3. LoveDA Dataset

4.3. Results and Analysis

4.3.1. Experimental Results on the ISPRS Potsdam Dataset

4.3.2. Experimental Results on the UAVid Dataset

4.3.3. Experimental Results on the LoveDA Dataset

4.4. Ablation Study

5. Discussion

5.1. Discussion on Computational Efficiency

5.2. Discussion on Loss

5.3. Discussion on Loss Hyperparameters

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI