EFMANet: An Edge-Fused Multidimensional Attention Network for Remote Sensing Semantic Segmentation

Chen, Yunpeng; Cheng, Shuli; Du, Anyu

doi:10.3390/rs17223695

Open AccessArticle

EFMANet: An Edge-Fused Multidimensional Attention Network for Remote Sensing Semantic Segmentation

by

Yunpeng Chen

,

Shuli Cheng

and

Anyu Du

^*

School of Computer Science and Technology, Xinjiang University, Ürümqi 830046, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(22), 3695; https://doi.org/10.3390/rs17223695

Submission received: 31 August 2025 / Revised: 30 October 2025 / Accepted: 10 November 2025 / Published: 12 November 2025

(This article belongs to the Special Issue Multi-Task Remote Sensing Image Analysis: Classification, Segmentation, and Change Detection)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

EFMANet integrates edge information into the network level by level and extracts multi-dimensional spatial information to achieve semantic segmentation of complex remote sensing images.
EFM integrates edge information, while MCFA extracts multi-dimensional spatial information, jointly improving the model’s segmentation accuracy.

What are the implications of the main findings?

EFMANet provides an edge fusion strategy for remote sensing semantic segmentation, improving the accuracy of object edge segmentation and spatial information perception.
The proposed module can effectively alleviate the problem of edge gradient vanishing in the network and help understand spatial structure information.

Abstract

Accurate semantic segmentation of remote sensing images is crucial for geographical studies. However, mainstream segmentation methods, primarily based on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), often fail to effectively capture edge features, leading to incomplete image feature representation and missing edge information. Moreover, existing approaches generally overlook the modeling of relationships between channel and spatial dimensions, restricting effective interactions and consequently limiting the comprehensiveness and diversity of feature representation. To address these issues, we propose an Edge-Fused Multidimensional Attention Network (EFMANet). Specifically, we employ the Sobel edge detection operator to obtain rich edge information and introduce an Edge Fusion Module (EFM) to fuse the downsampled features of the original and edge-detected images, thereby enhancing the model’s ability to represent edge features and surrounding pixels. Additionally, we propose a Multi-Dimensional Collaborative Fusion Attention (MCFA) Module to effectively model spatial and channel relationships through multi-dimensional feature fusion and integrate global and local information via an attention mechanism. Extensive comparative and ablation experiments on the Vaihingen and Potsdam datasets from the International Society for Photogrammetry and Remote Sensing (ISPRS), as well as the Land Cover Domain Adaptation (LoveDA) dataset, demonstrate that our proposed EFMANet achieves superior performance compared to existing state-of-the-art methods.

Keywords:

remote sensing; images semantic segmentation; edge detection; edge fusion; multidimensional attention

1. Introduction

Semantic segmentation is an essential task in computer vision. Remote sensing segmentation has gained popularity recently due to many practical uses. Improvements in aerospace systems and sensors make it easy to collect vast amounts of high-quality imagery that reflect environmental conditions and human presence. Semantic segmentation applied to high-precision remote sensing images has become a crucial tool for urban planning [1], land use analysis [2], and road extraction [3], among other tasks. Remote sensing semantic segmentation involves pixel-level classification of remote sensing images, assigning a category label to each pixel. Traditional methods, such as Random Forests (RFs) [4], and Conditional Random Fields (CRFs) [5], have limited performance on complex remote sensing imagery. The growing accessibility of large-scale remote sensing data, coupled with advancements in deep learning and parallel computing capabilities, has led to the widespread adoption of deep learning algorithms for remote sensing scene classification.

CNNs have been extensively utilized for their strong ability to extract features effectively. The introduction of FCN [6] marked the first instance of applying CNNs to pixel-level prediction tasks, enabling end-to-end training and dense output prediction. Later, SegNet [7] adopted an encoder-decoder architecture with pooling indices to refine segmentation outputs while maintaining computational efficiency. U-Net [8] further improved feature recovery by introducing skip connections that transmit high-resolution spatial information from the encoder to the decoder, which is particularly useful for delineating fine structures and object boundaries. DeepLab [9] incorporated atrous spatial pyramid pooling (ASPP) and leveraged depthwise separable convolutions to reduce computational cost and parameter count, improving performance in complex scenes. More recent advances focus on multi-scale feature integration and global context modeling. MGNet [10] proposed a multiscale feature aggregation framework that incorporates global information to enhance semantic segmentation accuracy on remote sensing imagery. ASLMSHNet [11] developed a multi-scale hybrid network that effectively balances segmentation precision and computational efficiency, demonstrating robust performance on high-resolution urban and natural scenes.

In recent years, the rise of Transformers has underscored the importance of modeling long-range dependencies in vision tasks. Vision Transformers [12] recast an image as a sequence of tokens and apply global self-attention to aggregate semantic cues from distant regions, a strength that helps resolve ambiguities arising from spatially separated context. However, global attention can be computationally expensive and may underrepresent local structure, so the Swin Transformer [13] introduced a windowed attention scheme with shifted windows to limit complexity while enabling cross-window information exchange; this design produces a hierarchical feature pyramid resembling convolutional backbones and makes Transformer architectures practical for dense prediction tasks. Building on these ideas, SSS-Former [14] explicitly reinforce spatial topological relationships and mitigate semantic inconsistencies across scales, thereby combining convolutional locality with Transformer-level global reasoning. Together, these developments allow attention-based models to capture broad contextual information without sacrificing local detail or efficiency, a balance that is especially valuable for remote sensing segmentation where disambiguating spectrally similar classes, preserving fine boundaries, and leveraging scene-level context are all required simultaneously.

Compared to natural images, remote sensing images exhibit more complex spatial structures and greater variations in object appearances. Figure 1 illustrates the main challenges in remote-sensing semantic segmentation: intra-class inconsistency, inter-class similarity, and difficulty in boundary delineation. CNN- and Transformer-based networks can partially alleviate these issues by enhancing semantic representation and contextual perception. However, most of these methods still rely primarily on spatial feature segmentation and struggle to preserve fine-grained edge structures due to pooling, downsampling, or global attention smoothing. In complex remote sensing scenes, where object boundaries are intricate and highly intertwined, such limitations become more pronounced—particularly in edge-rich regions. Therefore, integrating explicit edge information into the network is essential for achieving accurate and detailed segmentation.

Edge information plays a crucial role in remote sensing semantic segmentation: as a structural cue it enables precise boundary localization and complements region-level semantic predictions. Recent works have explored explicit edge modeling in different ways. BaAFN [15] injects boundary priors into the feature fusion process via boundary-aware attention to sharpen contours, while MSEONet [16] optimizes multi-scale edge maps to guide segmentation and reduce inter-object confusion, though its fusion is restricted to the decoder stage. Guan et al. [17] apply edge-guided deformable convolutions together with point-set contour representation to tightly align oriented boundaries in SAR imagery, a strategy that is effective for oriented objects but less general for complex land-cover types. Overall, most existing approaches treat edge cues as auxiliary signals or fuse them late in the pipeline, which limits deep semantic–edge interaction and weakens multi-scale and multi-directional continuity; given the intricate spatial structures and spectral similarities in remote sensing data, this often leads to blurred boundaries and misclassification in overlapping regions.

Inspired by the above discussion, this paper proposes EFMANet to effectively integrate edge information into the segmentation process while preserving rich spatial semantics in remote sensing images. The network adopts a dual-branch architecture that fuses boundary cues during the encoding stage and enhances feature representation through attention-guided fusion. Specifically, the Edge Fusion Module (EFM) focuses on extracting fine-grained structural and boundary details from the input, providing explicit contour guidance for the encoder. Meanwhile, the Multi-Dimensional Collaborative Fusion Attention(MCFA) module is designed to adaptively integrate spatial and edge-aware representations across multiple channels, ensuring complementary information exchange between the two branches. By allowing EFM to supply boundary-sensitive priors and MCFA to perform global-local feature alignment, the network achieves precise object localization and smoother boundary delineation. This cooperative mechanism enables EFMANet to retain edge fidelity while enhancing semantic discrimination, thereby improving segmentation consistency in complex and cluttered remote sensing scenes. Overall, the integration of EFM and MCFA achieves a balanced enhancement of spatial structure, semantic coherence, and boundary precision. The contributions are summarized in three aspects as follows:

1: The Edge Fusion Module (EFM) performs multimodal attention fusion on downsampled edge features, enhancing boundary information while preserving object shapes. This integration of edge details allows the model to better capture structural features, thus improving segmentation accuracy.
2: The Multi-Dimensional Collaborative Fusion Attention (MCFA) Module fuses contextual information by aggregating spatial and channel features. It integrates both global and local information, ensuring a comprehensive representation of spatial and channel attributes. Through this fusion mechanism, MCFA strengthens feature interactions across multiple dimensions, optimizing the model’s ability to extract relevant information.
3: The Edge Fusion Multidimensional Attention Network (EFMANet) combines the above components in a dual-branch structure for multimodal information fusion with a boundary-aware approach. The first stage focuses on spatial extraction and multimodal feature fusion, while the second stage uses multidimensional attention to aggregate spatial and channel information. Additionally, fusion attention is applied to integrate global and local information, enabling the model to capture complete contextual details. Experimental results demonstrate the architecture’s effectiveness, showing superior performance on three mainstream remote sensing segmentation datasets.

2. Related Work

2.1. Single Modal Semantic Segmentation

FCN [6] introduced an end-to-end pixel-wise framework, enabling direct learning of spatially dense predictions. Despite its pioneering role, FCN suffers from limited receptive fields and blurred object boundaries. UNet [18] addressed some of these issues with a encoder-decoder structure, the encoder extracts multi-scale features via progressive downsampling and the decoder gradually restores spatial resolution to capture contextual information. SegNet [7] introduced a similar encoder-decoder design but leveraged pooling indices for efficient upsampling to maintain computational efficiency. Transformer-based approaches capture long-range dependencies and global context. UNetFormer [19] combine UNet and transformers, while hierarchical transformers like Swin Transformer [13] have shown strong performance in capturing multi-scale contextual information for accurate segmentation in remote sensing applications.

2.2. Multimodal Semantic Segmentation

A modality denotes a specific form of information representation, and multimodal learning aims to jointly reason over two or more complementary sources. Early works mainly focused on how and where to fuse. FuseNet [20] adopts an early-fusion dual-branch design: features from RGB and DSM streams are extracted with parallel convolutions and then aggregated via element-wise addition, enabling shallow cross-modal interaction but potentially mixing noise at low levels. To mitigate premature fusion, vFuseNet [21] follows a late-fusion paradigm that aggregates multi-scale representations near the decoder, which preserves modality-specific cues longer and improves robustness to misalignment. Moving beyond pure CNN fusion, CMFNet [22] pioneers the use of transformers for multimodal remote sensing, introducing multi-scale cross-modal attention to exchange information between RGB and DSM while adding residual links from fused tokens to the decoder to stabilize optimization. Although not designed specifically for multimodality, BiSeNet [23] is frequently adopted as a backbone due to its two-path design—one path maintains high-resolution spatial details while the other enlarges the receptive field—making it a strong base for fusing geometry with appearance. Recent studies refine fusion granularity and alignment. TMFNet [24] proposes a transformer-based multi-modal fusion network that explicitly injects height tokens from DSM into multi-level RGB features, using cross-attention to condition appearance cues on elevation and thus sharpening building/terrain boundaries. FTransUNet [25] introduces a multilevel fusion backbone that hybridizes CNN for locality with ViT-style token mixing for long-range context, performing progressive RGB and DSM fusion from shallow to deep stages to balance local geometry and global semantics.

In this work, we propose using a dual-branch network to enhance boundary information extraction and fusion, better utilizing boundary details in multimodal data.

2.3. Boundary Enhancement

Accurate delineation of object borders is crucial in remote sensing, where adjacent categories often exhibit similar textures. Classical edge operators such as Sobel [26] and Laplacian [27] provide gradient-based priors that can be injected into modern networks as auxiliary channels or supervision. SEANet [28] employs an edge-aware loss function to enhance boundary accuracy, addressing the challenge of precise boundary delineation in complex scenes. BEDSN [29] uses a dedicated edge detection branch to explicitly compensate for boundary loss and fuses it with the semantic segmentation branch via a coupled encoder and multilevel fusion module. BGSNet [30] employs a boundary-guided Siamese multitask architecture, where a dedicated boundary branch supervises edge learning and guides the segmentation branch to better preserve object contours. DBBANet [31] employs a dual-branch design combining spatial and boundary-aware branches, explicitly enhancing edge features to improve fine-grained farmland segmentation. BaAFN [15] integrates boundary-aware attention into feature fusion, emphasizing edge regions to refine segmentation boundaries while maintaining semantic consistency.

In this work, we compute Sobel edges from the input and inject them into the network as an auxiliary signal. The edge stream informs the fusion and decoding stages, alleviating blurry transitions and improving class separability at boundaries—especially for thin structures and densely packed small objects.

2.4. Attention Mechanisms

Attention, inspired by human visual selection, weights informative cues and suppresses distractions to focus representations. Unetformer [19] implements complementary global and local attention, global tokens provide scene context while local attention preserves fine edge details and related information. CMTFNet [32] injects multi-scale and channel priors into self-attention so the model is both scale-aware and spectrally discriminative, improving separation across varying object sizes. MSGCNet [33] replaces plain skip connections with windowed cross-attention between encoder and decoder features, enabling hierarchical multi-scale fusion with controlled computation. These approaches motivate our modality-aware multi-dimensional attention that balances boundary preservation and scene-level coherence.

In this work, we adopt a multi-dimensional attention design that couples spatial and channel attention. Spatial attention highlights edge-localized regions and long-range dependencies across large rooftops/roads, while channel attention emphasizes modality-complementary cues, resulting in sharper boundaries and stronger feature mapping throughout the decoder.

3. Methodology

In this section, we will provide a detailed description of the overall structure and components of EFMANet. The general architecture of the network will be introduced. In the following section, we will discuss the Sobel operator for edge detection and the EFM module for edge fusion. Finally we will cover the structure and principles of the MCFA module and Loss function.

3.1. Sobel Operator

In this section, we introduce the Gaussian blur and the Sobel operator in edge extraction. Applying a Gaussian filter first suppresses high-frequency noise and smooths small texture variations, which prevents spurious gradients from dominating subsequent edge estimates and yields more stable and coherent edge maps. After smoothing, the Sobel operator composed of two small convolutional kernels oriented along the horizontal and vertical axes computes approximate first-order derivatives and produces gradient maps that emphasize intensity transitions. Combining these directional gradients into a single magnitude map yields an orientation-agnostic edge strength that is robust to local perturbations. The whole pipeline therefore balances local gradient sensitivity with noise robustness and slightly more global context introduced by the smoothing kernel. This can be represented as:

\begin{matrix} X_{blurred} (x, y) = \frac{1}{2 π σ} \sum_{i, j} X (x + i, y + j) e^{- \frac{i^{2} + j^{2}}{2 σ^{2}}} \\ G_{x} = X_{blurred} \times K_{x} \\ G_{y} = X_{blurred} \times K_{y} \\ E = \sqrt{G_{x}^{2} + G_{y}^{2}} \end{matrix}

(1)

where

X_{x + i, y + j}

is the pixel value at the position

(x + i, y + j)

in the image, and

σ

is the standard deviation of the Gaussian distribution, which controls the degree of blurring.

i, j

are the offsets within the kernel window. The weight distribution is defined, and

K_{x}, K_{y}

are the Sobel kernels used to calculate the gradients.

G_{x}

and

G_{y}

are the gradient maps of the image in the horizontal and vertical directions, and E is the combined edge intensity map.

3.2. Overall Architecture

The main architecture of EFMANet is a two-stage network. In the first stage, semantic and multi-scale edge gradient features are progressively extracted through the backbone and CNN-based networks, enabling rich spatial and contextual information acquisition. In the second stage, a Multi-Dimensional Collaborative Attention is employed to extract structural features from both horizontal and vertical perspectives. This is especially crucial for segmenting objects with strong spatial characteristics and inter-object relationships in remote sensing imagery. It enhances the model’s spatial awareness and structural understanding. Finally, a Fusion Attention module integrates the previously obtained global and local information across different scales, helping the model better capture the semantic associations among objects of varying sizes.

The overall architecture of EFMANet is illustated in Figure 2, which is designed as atwo-stage edge-guided segmentation framework emphasizing edge-aware feature fusion and multidimensional attention modeling. In the first stage, Edge-Fusion Encoder, the network adopts a dual-branch feature extraction mechanism: the first branch utilizes ConvNeXt to extract hierarchical semantic features from the original image, while the second branch generates an edge-enhanced image using the Sobel operator and extracts complementary boundary representations through a lightweight CNN. The features from both branches are fused via the EFM, where multi-scale fused features from the last three encoder stages are aligned to a uniform resolution through convolution and merged to guide the network in capturing edge cues across multiple scales, thereby preserving gradient continuity and improving boundary consistency. The second stage Feature Enhancement and Fusion Decoder focuses on global-local interaction through hierarchical attention modeling. Specifically, the MSA enhances the fused features by modeling interactions between horizontal and vertical dimensions, while MDFA module integrates enhanced contextual features with original encoder representations, enabling the model to learn complementary spatial dependencies and structural relationships among objects. Through this combination of edge-guided feature fusion and multidimensional attention modeling, EFMANet achieves effective joint learning of local edge precision and global contextual understanding, leading to more accurate and coherent segmentation performance.

For the encoder part, we use X and Y to represent the input image and its corresponding edge-detected image, respectively. The first branch of the encoder extracts hierarchical features from the input image, producing four different scales of feature maps:

X_{1} \in R^{C \times H / 4 \times W / 4}

,

X_{2} \in R^{2 C \times H / 8 \times W / 8}

,

X_{3} \in R^{4 C \times H / 16 \times W / 16}

,

X_{4} \in R^{8 C \times H / 32 \times W / 32}

. The second branch first applies the Sobel operator to the original image to generate edge-detected image. This edge image is then processed by a CNN encoder, which further extracts multi-scale edge features. The downsampled feature maps are represented as

C_{i} \times \frac{H}{2_{i - 1}} \times \frac{W}{2_{i - 1}}

, where

C_{i}

is the number of channels in the CNN encoder layer.Each layer’s input feature undergoes a 3 × 3 convolution, followed by a ReLU activation function and a batch normalization function, and finally passes through another 3 × 3 convolution for downsampling. The above CNN encoder can be represented as:

{\tilde{Y}}_{i} = {Conv}_{3 \times 3} (BN (S i g m o i d ({Conv}_{3 \times 3} (Y_{i})))), i = 1, 2, 3, 4

(2)

Subsequently, the feature maps of corresponding scales from the first and second branches are fused through the EFM module, which can be represented as:

\begin{matrix} {\tilde{X}}_{i} = EFM (X_{i}, {\tilde{Y}}_{i}), i = 1, 2, 3, 4 \\ F_{x} = Cat (C o n v_{1 \times 1} ({\tilde{X}}_{2}), C o n v_{1 \times 1} ({\tilde{X}}_{3}), C o n v_{1 \times 1} ({\tilde{X}}_{4})) \end{matrix}

(3)

where

E F M (\cdot)

represents the EFM block used for the fusion of image features and edge features.

X_{i}

represents RGB images encoder feature,

{\tilde{Y}}_{i}

represents the edge detection features from CNN encoder, and

{\tilde{X}}_{i}

represents the fusd feature.

C o n v_{1 \times 1}

indicates a

1 \times 1

convolution, Cat represents the concatenation operation, and

F_{x}

represents the fused features of

{\tilde{X}}_{2}

,

{\tilde{X}}_{3}

, and

{\tilde{X}}_{4}

.

Then,

F_{x}

is used as the input feature for second-stage. After exchanging the dimensions of

F_{x}

, it is processed through the MCFA. The horizontally and vertically enhanced features obtained from MSA are used as the key (K) and value (V) inputs for the corresponding attention mechanism, while the fused output features

F_{x}

serve as the query (Q). These inputs are fed into the MDFA module to integrate the original features with the horizontal and vertical contextual relationships. Finally, the fused attention features, along with

F_{x}

, are combined. After passing through the segmentation head, the final segmentation result is obtained. This can be expressed as follows:

\begin{matrix} F_{x} = MSA (P_{h} (F_{x})) \\ F_{x} = MSA (P_{w} (F_{x})) \\ F_{m} = MDFA (F_{h}, F_{w}) \\ F_{out} = Cat (F_{m}, F_{x}) \end{matrix}

(4)

where

M S A (\cdot)

is a module designed to enhance the corresponding channels of the input image.

P_{h}

converts the dimensions of

F_{x}

from

B, C, H, W

to

B, H, C, W

, while

P_{w}

converts the dimensions of

F_{x}

from

B, C, H, W

to

B, W, H, C

, adapting to the matching branch in the MSA module. MDFA is a hybrid attention module, where

F_{h}

represents the vertically enhanced features processed by the MSA module,

F_{w}

represents the horizontally enhanced features, and

F_{m}

represents the result of MDFA.

F_{o u t}

denotes the final fused features used for segmentation output.

3.3. Edge Fusion Module

To effectively integrate the features generated by downsampling the original image through the encoder with the corresponding edge features extracted from the CNN-based edge encoder, we propose the Edge Fusion Module (EFM), as illustrated in Figure 3. In the EFM, the dual-branch design enables complementary feature interaction between texture and edge representations. Specifically, the ordinary image branch captures multi-scale semantic context using parallel

3 \times 3

and

5 \times 5

convolutions, whose outputs are fused to enhance feature diversity. The edge branch focuses on fine-grained boundary cues by employing

1 \times 1

and

3 \times 3

convolutions, followed by element-wise summation to reinforce edge responses. Both branches are subsequently passed through Global Average Pooling GAP and a convolution layer, then activated by ReLU and normalized using Sigmoid to generate adaptive attention weights. Finally, these attention-modulated features are recalibrated and multiplied with the original inputs, followed by fusion to produce the final representation. This design allows EFM to adaptively balance semantic and boundary information across scales, achieving effective texture-edge synergy for enhanced segmentation accuracy, where

C_{i}

denotes the number of channels at each stage.

3.4. Multi-Dimensional Collaborative Fusion Attention

3.4.1. Multidimensional Spatial Attention

We designed Multidimensional spatial attention(MSA) to integrate channel and spatial features, as illustrated in Figure 4. This module consists of two similar branches, which enhance features in both the horizontal and vertical directions. Specifically, let

F_{x} \in R^{C \times H \times W}

be the feature map obtained after fusing the last three stages of the encoder. For the vertical branch, the dimensions of

F_{x}

are first rearranged to obtain

F_{x h} \in R^{H \times C \times W}

. Then,

F_{x h}

is processed by AvgPool and StdPool to aggregate the features into

F_{p h} \in R^{H \times 1 \times 1}

, followed by a dimension transformation to

F_{p h} \in R^{1 \times 1 \times H}

. This process can be formulated as:

\begin{matrix} F_{x h} = P_{h} (F_{x}) \\ F_{p h} = \frac{1}{2} (AvgPool (F_{x h}) + StdPool (F_{x h})) \end{matrix}

(5)

where,

P_{h}

is responsible for transforming the dimensions of

F_{x} \in R^{C \times H \times W}

into

F_{x h} \in R^{H \times C \times W}

. Additionally, AvgPool represents adaptive average pooling, while StdPool refers to standard deviation pooling.

Next, a

1 \times 3

convolution and a linear layer are used to further extract spatial and channel-wise information. A Sigmoid activation generates corresponding weights, which are applied to the convolution outputs

F_{c}

and

F_{l}

. These weighted features are then summed to obtain

F_{m} \in R^{H \times 1 \times 1}

. Finally,

F_{m}

is passed through a Sigmoid activation and multiplied element-wise with

F_{h}

to produce the vertically enhanced feature

F_{h}

. This process can be formulated as:

\begin{matrix} F_{c} = P_{1}^{- 1} ({Conv}_{1 \times 3} (P_{1} (F_{ph}))) \\ F_{l} = P_{1}^{- 1} (Linear (P_{1} (F_{ph}))) \\ F_{m} = F_{c} (σ (F_{c})) + F_{l} (σ (F_{l})) \\ F_{h} = σ (F_{m}) F_{x h} \end{matrix}

(6)

where

F_{c}

represents the feature processed by convolution, and

F_{l}

denotes the feature processed by the linear layer.

P_{1}

is the operation that swaps the dimensions of

F_{ph} \in R^{H \times 1 \times 1}

to

F_{ph} \in R^{1 \times 1 \times H}

, and

P_{1}^{-} 1

restores the dimensions to

F_{c}, F_{l} \in R^{H \times 1 \times 1}

.

{Conv}_{1 \times 3}

indicates a

1 \times 3

convolution, and Linear represents the linear layer,

σ

denotes the Sigmoid activation function.

Similarly, for the horizontal direction, we swap the dimensions of the encoder output fused feature to

F_{x w} \in R^{W \times H \times C}

, then apply the same AvgPool and StdPool to obtain the aggregated feature

F_{p w}

. Afterward, we perform further feature extraction using a

1 \times 3

convolution and a Linear layer. After the features are fused using Sigmoid, the result is expanded to match the original input feature size and multiplied to obtain the enhanced feature

F_{w}

. This can be expressed as:

\begin{matrix} F_{x w} = P_{w} (F_{x}) \\ F_{p w} = \frac{1}{2} (AvgPool (F_{x w}) + StdPool (F_{x w})) \end{matrix}

(7)

where

P_{w}

represents the operation of converting the dimensions of

F_{x} \in R^{C \times H \times W}

to

F_{x w} \in R^{W \times H \times C}

, AvgPool refers to adaptive average pooling, and StdPool refers to standard deviation pooling.

\begin{matrix} F_{c} = P_{2}^{- 1} ({Conv}_{1 \times 3} (P_{2} (F_{pw}))) \\ F_{l} = P_{2}^{- 1} (Linear (P_{2} (F_{pw}))) \\ F_{m} = F_{c} (σ (F_{c})) + F_{l} (σ (F_{l})) \\ F_{w} = σ (F_{m}) F_{x w} \end{matrix}

(8)

where

σ

denotes the Sigmoid activation function,

F_{c}

represents the feature processed by convolution, and

F_{l}

denotes the feature processed by the linear layer.

P_{2}

is the operation that swaps the dimensions of

F_{pw} \in R^{W \times 1 \times 1}

to

F_{pw} \in R^{1 \times 1 \times W}

, and

P_{2}^{- 1}

restores the dimensions to

F_{c}, F_{l} \in R^{W \times 1 \times 1}

.

{Conv}_{1 \times 3}

indicates a

1 \times 3

convolution, and Linear represents the linear layer.

3.4.2. Multi-Directional Fusion Attention Module

To fuse enhanced features, we designed MDFA attention, as shown in Figure 5. Specifically, we take the feature map after encoder fusion as Q, and the vertically enhanced feature map as

K V

to calculate vertical attention. We then take the horizontally enhanced feature map as

K V

again to compute horizontal attention. Finally, the vertical and horizontal attention components are summed and incorporated into the original input feature map via a residual connection, producing the final output. This process can be expressed as:

\begin{matrix} (Q_{h}, K_{h}, V_{h}) = (F x, F_{h}, F_{h}) \\ (Q_{w}, K_{w}, V_{w}) = (F x, F_{w}, F_{w}) \\ Attnh = Softmax (\frac{Q \times K_{h}^{T}}{\sqrt{d_{k}}}) \times V_{h} \\ Attnw = Softmax (\frac{Q \times K_{w}^{T}}{\sqrt{d_{k}}}) \times V_{w} \\ F_{out} = BN (λ (Attnh + Attnw) + F x) \end{matrix}

(9)

where

F_{h}

and

F_{w}

are the vertically and horizontally enhanced features after MSA, respectively;

F_{x}

is the fused feature from the first stage; Attn denotes the directional attention mechanism;

{(\cdot)}^{T}

indicates matrix transposition;

d_{k}

is the key vector dimension, with its square root as a scaling factor for stable attention scores; and

λ

is a learnable parameter.

3.5. Loss Function

The loss function employed in this paper integrates Dice loss

L_{dice}

and cross-entropy loss

L_{ce}

, formulated as follows:

\begin{matrix} L_{ce} = - \frac{1}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{K} y_{k}^{(n)} log {\hat{y}}_{k}^{(n)} \end{matrix}

(10)

\begin{matrix} L_{dice} = 1 - \frac{2}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{K} \frac{{\hat{y}}_{k}^{(n)} y_{k}^{(n)}}{{\hat{y}}_{k}^{(n)} + y_{k}^{(n)}} \end{matrix}

(11)

\begin{matrix} L = L_{ce} + L_{dice} \end{matrix}

(12)

where N and K denote the number of samples and classes, respectively.

y_{k}^{(n)}

represents the ground-truth semantic label in one-hot encoding, while

{\hat{y}}_{k}^{(n)}

corresponds to the softmax output of the network. Specifically,

{\hat{y}}_{k}^{(n)}

indicates the predicted confidence score for sample n belonging to class k.

4. Experimental and Analysis

This section introduces the experimental design and analysis of the proposed model, including comparative evaluations against other approaches and ablation studies to demonstrate the effectiveness of each module.

4.1. Datasets

Vaihingen Dataset: This dataset, from ISPRS 2D Semantic Labeling Challenge, captured over the Vaihingen city area in Germany. Each image tile has a spatial resolution of 2494 × 2064 pixels. The dataset includes six semantic categories: impervious surfaces, buildings, low vegetation, trees, cars, and an additional background class. For our experiments, only the RGB infrared TOP images are utilized. Following the official dataset split defined by the ISPRS benchmark, we divide the available images into training and validation subsets. Each image tile is further cropped into 1024 × 1024 pixels to standardize the input size for model training and evaluation.

Potsdam Dataset: This dataset captured over the urban region of Potsdam, Germany. Each tile has a dimension of 6000 × 6000 pixels, offering highly detailed spatial information. The semantic annotation includes six classes following the same labeling scheme as the Vaihingen dataset. In accordance with the official split provided by the ISPRS benchmark, 22 image tiles (excluding 7_10) are allocated for training, while other 15 images are designated for testing. For model input standardization, all tiles are divided into 1024 × 1024 pixels. In specific evaluation settings, the clutter category is excluded to focus on core land cover classification performance.

LoveDA Dataset: This dataset comprises 5987 images, each with a dimension of 1024 × 1024 pixels. The imagery is collected from three representative Chinese cities—Nanjing, Changzhou, and Wuhan—capturing diverse environmental contexts. The dataset is annotated with seven land cover categories and is partitioned into 2522 images for training, 1669 for validation, and 1796 for testing. Notably, LoveDA introduces significant domain shifts between rural and urban scenes, characterized by varying object scales, intricate background textures, and severe class imbalance, making it a valuable resource for evaluating domain-adaptive semantic segmentation algorithms.

4.2. Experimental Details

The experimental environment consisted of Ubuntu 24.04 and PyTorch 2.2 running on a single NVIDIA A40 GPU. Models were optimized with AdamW at an initial learning rate of

6 \times 10^{- 4}

and trained under a cosine learning-rate scheduler. Training patches of size

1024 \times 1024

were sampled from the Vaihingen, Potsdam and LoveDA datasets, and on-the-fly augmentations (random scale, vertical/horizontal flips and rotation) were applied. During inference, multi-scale testing with random flips was employed to enhance performance stability.

4.3. Evaluation Metrics

In this experiment, we assess the performance of the model using widely adopted metrics in remote sensing semantic segmentation, including Mean Intersection over Union (mIoU), Overall Accuracy (OA), and F1 score. These metrics quantify both class-wise and overall prediction quality as follows:

\begin{matrix} mIoU = \frac{1}{N} \sum_{i = 1}^{N} \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i} + {FN}_{i}} \end{matrix}

(13)

\begin{matrix} OA = \frac{\sum_{i = 1}^{N} {TP}_{i}}{\sum_{i = 1}^{N} ({TP}_{i} + {FP}_{i} + {TN}_{i} + {FN}_{i})} \end{matrix}

(14)

\begin{matrix} Precision = \frac{TP}{TP + FP} \end{matrix}

(15)

\begin{matrix} Recall = \frac{TP}{TP + FN} \end{matrix}

(16)

\begin{matrix} F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{matrix}

(17)

where

{TP}_{k}

,

{FP}_{k}

,

{TN}_{k}

, and

{FN}_{k}

represent the counts of true positives, false positives, true negatives, and false negatives for each category k. The mIoU quantifies the average overlap between predicted regions and ground truth across all classes, while OA indicates the fraction of pixels correctly classified over the entire image. Precision measures the accuracy of positive predictions, Recall assesses how completely the model captures relevant pixels, and the F1 score combines both into a harmonic mean, providing a balanced assessment of segmentation quality.

We also assess computational cost by reporting FLOPs and the total number of parameters, based on an input size of

3 \times 1024 \times 1024

, to allow fair comparison between different methods.

4.4. Comparative Experiments

4.4.1. Comparative Experiments on the Vaihingen Dataset

Table 1 presents the comparison of methods on Vaihingen dataset. The results demonstrate that, under the same conditions, our proposed EFMANet achieves 92.28%, 85.92%, and 94.00% in terms of average F1-score, mIoU, and OA. Compared to the second-best method, EFMANet improves the mIoU by 1.01%. Compared to the traditional multi-scale convolutional network PSPNet and the Swin Transformer-based DC-Swin network, EFMANet, which incorporates edge information, exhibits superior segmentation capability, enabling better recognition of objects belonging to the corresponding categories.

To facilitate a more comprehensive comparison, we present the zoomed-in segmentation results of EFMANet alongside other networks, as shown in Figure 6. The first and third rows display the segmentation details for large object boundaries and dense small objects, respectively. The edge map successfully identifies the boundaries of large buildings and the edges of clustered small vehicles. Compared to traditional convolutional networks like ABCNet and multi-scale attention-based networks such as MSGCNet, EFMANet demonstrates superior edge detail segmentation in complex scenarios. The second row shows that, for buildings adjacent to shadows, the edge detection map effectively distinguishes the buildings from the shadowed regions, whereas other networks misidentify the shadow as part of the building’s extension. Moreover, EFMANet, which integrates edge information in the encoder part, outperforms the boundary-guided MCCANet and frequency-domain-based SFFNet in recognition. Meanwhile, these methods significantly surpass traditional purely spatial segmentation networks, such as ABCNet, A2-FPN, and MSGCNet, particularly in handling shadowed areas and regions with significant texture variations. Finally, the last row illustrates the improvement in vegetation and tree recognition due to the enhancement of edge contours, especially when inter-class similarities exist while intra-class differences are clear.

4.4.2. Comparative Experiments on the Potsdam Dataset

To further explore the capability of our model, we conducted experiments on the larger and more complex Potsdam dataset. The results are presented in Table 2. On the Potsdam dataset, EFMANet continues to exhibit outstanding performance, achieving F1, mIoU, and OA scores of 93.52%, 88.03%, and 92.53%. Compared to the second-best method, EFMANet improves mIoU by 0.52%. While surpassing purely spatial segmentation networks, EFMANet also outperforms MCCANet, which incorporates edge segmentation, and SFFNet, which leverages frequency-domain information, demonstrating its superior segmentation capability.

To better visualize the differences, we provide magnified segmentation results, as shown in Figure 7. The first row demonstrates accurate edge segmentation of small objects in complex scenarios where small objects are densely clustered. The edge map clearly displays the boundary contours of these objects, allowing the model to precisely segment cluttered edges. The second row shows that edge contours help achieve more accurate object recognition. For instance, in the case of the truck in the image, some networks misidentify it as background or other labels due to the cargo it carries. However, with the integration of edge information, EFMANet’s edge map captures the complete outline of the truck, enabling it to accurately recognize the truck even when affected by the cargo. The third row showcases the edge segmentation of building and tree boundaries, where the edge map precisely distinguishes the boundary between buildings and trees. The fourth row highlights the segmentation of tree and vegetation boundaries, as well as the identification of impervious surfaces that are shadowed. The edge map can effectively identify the boundary of trees and vegetation, as well as accurately detecting the black impervious surfaces in the image. This allows the model to precisely recognize tree vegetation and extract impervious surface roads. In comparison to other networks, EFMANet shows superior performance in handling complex boundaries and shadowed areas, demonstrating better recognition ability, especially in scenarios with inter-class similarities.

4.4.3. Comparative Experiments on the LoveDA Dataset

To further validate the segmentation ability of our model in complex scenarios, we evaluated EFMANet on the LoveDA dataset, with results presented in Table 3. EFMANet achieves an mIoU of 54.48%, surpassing the second-best method by 1.13%. Although it does not reach the best score in every individual category due to dataset complexity, it performs well on buildings, roads, water, and forests, and achieves the best overall mIoU.

Visualization results on LoveDA are shown in Figure 8 which displays the enlarged results of EFMANet on the LoveDA dataset. The first row demonstrates the accurate recognition of water edges and roads in the case of densely clustered and complexly interwoven objects. The edge map clearly shows that the overall contour of the objects is identified, preventing the issue of hollow areas that may appear in other segmentation maps. The second row displays the segmentation of complex building edges. Compared to other models, the introduction of edge information generates more precise building boundaries, enabling EFMANet to still perform accurate building recognition in situations with intertwined edges. The third row shows the accurate recognition of the edges between buildings and surrounding forest in densely overlapping objects. The model can precisely identify the building, forest, and background. The fourth row highlights the segmentation of categories with similar and complex boundaries. Overall, EFMANet demonstrates superior performance in recognizing complex boundaries and improving category-level segmentation accuracy.

4.4.4. Comparative Experiments Using McNemar Test on the Vaihingen and Potsdam Datasets

Table 4 presents the McNemar test results on the Vaihingen and Potsdam datasets, comparing networks categorized into conventional spatial segmentation methods and edge-enhanced approaches. It can be observed that among the traditional spatial segmentation networks, such as PSPNet, DeepLabv3, and ABCNet, EFMANet consistently achieves lower B and C values and smaller

χ^{2}

statistics, indicating fewer misclassifications and better alignment with ground truth. While edge-enhanced methods like MCCA, DCSwin, and MANet improve boundary preservation, they often show larger

χ^{2}

values, reflecting higher variance in predictions. In contrast, the low

χ^{2}

values of EFMANet demonstrate stable and robust performance, highlighting the effectiveness of its edge-fusion encoder and multi-dimensional collaborative fusion attention (MCFA) modules in capturing fine-grained structural details and integrating global-local contextual information. Overall, this comparison emphasizes that EFMANet not only improves segmentation accuracy across both categories but also ensures statistically significant performance gains, particularly in challenging edge-sensitive regions.

4.5. Ablation Experiments

To evaluate the effectiveness of each component in EFMANet, we conducted a series of ablation experiments on the Vaihingen and Potsdam datasets. In these experiments, we sequentially removed the added modules and primarily focused on two performance metrics: mIoU and meanF1.

4.5.1. Ablation Experiments of EFMANet Components

Table 5 presents the results after removing individual modules from EFMANet. Here, EFM represents the Edge Fusion Module, MCFA denotes the Multi-Dimensional Collaborative Fusion Attention Module, MCFA-W refers to the horizontal feature enhancement module, MCFA-H corresponds to the vertical feature enhancement module, and MDFA stands for the Multi-Dimensional Dual-Branch Fusion Attention mechanism. When the EFM module is removed from EFMANet, the corresponding edge branch in the encoder is also eliminated. Similarly, when the horizontal and vertical MCFA are removed, the corresponding attention branches in MDFA are also excluded. The removed components in the ablation study are indicated as “(w/o)”. We also conducted ablation experiments on the modules based on the baseline, as shown in the Table 6.

Figure 5 illustrates the segmentation results after removing individual modules from EFMANet. The results show a significant performance drop whenever any module is removed. To further verify the effectiveness of each module, we conducted additional experiments by integrating individual modules into the baseline model and evaluating the performance on the Vaihingen dataset. The Baseline directly uses the fourth-layer features output by the encoder.

4.5.2. Effectiveness of Sobel

Table 7 presents a comparison of segmentation performance using different edge detection methods, including Laplacian, Canny, and the proposed Sobel-based approach. All three methods achieve relatively high metrics, but the Sobel operator demonstrates superior performance, with an mIoU of 85.92%, overall accuracy (OA) of 94.00%, and an F1 score of 92.28%. In terms of computational complexity, the Sobel operator is lightweight and efficient, requiring only simple gradient calculations along the horizontal and vertical directions, which is advantageous for large-scale remote sensing images. Regarding spatial information extraction, Sobel effectively preserves edge orientation and fine structural details, enabling the network to accurately capture key object boundaries. In contrast, the Laplacian operator is sensitive to intensity changes but prone to amplifying noise, while Canny involves multi-stage processing, increasing computational overhead. Therefore, Sobel is more suitable in EFMANet.

4.5.3. Effectiveness of EFM

According to the data in Table 5, when the EFM module and its corresponding branch are removed from EFMANet on the Vaihingen dataset, the mIoU and F1 scores decrease by 0.77% and 0.47%. Similarly, after removing the EFM module and its corresponding branch on the Potsdam dataset, the mIoU and F1 scores drop by 0.99% and 0.57%. From Table 6 it can be observed that adding the EFM module and its corresponding encoder branch to the baseline model improves the mIoU by 3.45% and mean F1 by 2.19% on the Vaihingen dataset, with mIoU increasing from 81.59% to 85.04% and mean F1 increasing from 89.55% to 91.74%. Per-class gains are especially notable for small and thin objects. Car F1 increases from 81.63% to 89.29%, an increase of 7.66%, while Impervious Surface, Building, Lowveg and Tree also show consistent improvements.

Figure 9 show that after adding EFM to the baseline, the recognition accuracy for low vegetation and vehicles has improved. As shown in Figure 9b, after adding the EFM module the model achieves more precise recognition of low vegetation edges in complex boundary scenarios. Additionally the edge module enhances the accuracy of recognizing objects in shadowed areas. Figure 10b demonstrates that after incorporating the EFM module the model exhibits better discrimination of trees, vegetation and vehicles, particularly for objects with intricate and fine edges. Taken together the quantitative and qualitative results indicate that the multi-branch EFM effectively captures fine-grained, edge-aware features and materially improves segmentation performance on objects with complex boundaries. The results indicate that each of our individual modules is highly effective.

4.5.4. Effectiveness of Individual EFM on Baseline

To further verify the effectiveness of EFM, we progressively added each stage to the baseline for comparative experiments. The incremental ablation results in Table 8 show that adding EFM1 increases mIoU by 1.37% and mean F1 by 0.88%, with Car achieving the most significant improvement of 4.49 percentage points. After adding EFM2, mIoU further rises by 0.98% and mean F1 by 0.62%. The improvement from EFM3 is relatively smaller, with mIoU and mean F1 increasing by 0.47% and 0.29%. Adding EFM4 refines the results, improving mIoU by 0.63% and mean F1 by 0.40%, where Lowveg shows a notable increase of 2.03%. Overall, the complete multi-branch EFM improves mIoU by 3.45% and mean F1 by 2.19% compared with the baseline. The experimental results demonstrate the effectiveness of the EFM module on the baseline, particularly in enhancing segmentation performance for cars and low vegetation.

4.5.5. Effectiveness of the MCFA

The module consists of horizontal and vertical branches; thus, experiments were conducted for three different scenarios. According to Table 5, when the MCFA module of EFMANet was entirely removed on the Vaihingen dataset, the mIoU dropped by 0.88%. Removing only the vertical branch resulted in an mIoU decrease of 0.77% and an F1 decrease of 0.48%, while removing the horizontal branch led to an mIoU decrease of 0.67% and an F1 decrease of 0.42%. Similarly, on the Potsdam dataset, the removal of the entire MCFA module resulted in an mIoU drop of 0.92%. When the vertical branch was removed, the mIoU and F1 scores decreased by 0.78% and 0.45%. Removing the horizontal branch led to an mIoU decrease of 0.87% and an F1 decrease of 0.49%. From Table 6, it can be observed that adding the MCFA on baseline improves the mIoU and F1 scores by 0.69% and 0.45% on the Vaihingen dataset. Additionally, we separately added MCFA vertical branch and MCFA horizontal branch to the baseline model. They improved the mIoU by 0.35% and 0.43%.We also conducted experiments on MDFA within the MCFA. As shown in Table 5, after removing MDFA module, the mIoU and F1 scores on Vaihingen dataset decreased by 0.44% and 0.27%. Similarly, on the Potsdam dataset, removing MDFA led to an mIoU drop of 0.52% and an F1 drop of 0.29%.

Figure 11 show that after adding the MCFA module to the baseline model, the recognition of low vegetation and vehicles improved significantly. As seen in Figure 11b, the module enhances the shape recognition of vegetation and vehicles. Figure 10c illustrates that after adding MSA to an already EFM-equipped model, the segmentation response to low vegetation and building edges becomes more pronounced. Figure 10d indicates that the MDFA module effectively integrates global and local information, enabling more precise recognition of objects of varying sizes in the image.

4.5.6. Effectiveness of the MSA

Table 9 compares several attention variants on the same baseline. Channel-focused modules ECA-CA, SE and CA substantially increase F1 scores for large, homogeneous classes such as Impervious Surface and Building and produce higher overall accuracy. These channel attentions obtain Impervious Surface F1 around 96.5% and Building F1 around 95.5% with mean F1 close to 89.8% and mIoU near 82.0%. Our MSA achieves the best overall balance with mean F1 of 90.00% and mIoU of 82.28% by providing more even improvements across classes. MSA improves responses for Lowveg and Tree compared with the baseline and also raises Car F1 relative to the baseline. The results suggest that channel attentions are effective at strengthening global class responses for dominant categories while MSA, by integrating multi-scale spatial context with channel interactions, better enhances boundary and small object discrimination and yields the largest overall gains in mean F1 and mIoU.

4.5.7. Effectiveness of Pixel-Wise MDFA Fusion

The results in Table 10 demonstrate that the proposed MDFA-pixel add method significantly outperforms the conventional MDFA concatenation, achieving mIoU of 85.92% and an F1 of 92.28%, compared to 75.75% and 85.90% for MDFA-concat. This improvement can be attributed to the pixel-wise addition, which integrates horizontal and vertical directional features at each spatial location, ensuring that the enhanced features in corresponding directions are properly aligned. In contrast, direct channel concatenation merges features with different characteristics, which may be misaligned and introduce redundancy, reducing the effectiveness of feature fusion. Pixel-wise addition preserves object boundaries, maintains fine-grained structural details, and enhances interactions between global and local features, thereby improving segmentation accuracy across multiple object categories.

4.5.8. Different Backbone Analysis

To eliminate the influence of the backbone, we conducted replacement experiments on the Vaihingen dataset using ConvNeXt-Tiny, ResNet18, ResNeXt50 and ConvNeXt-Small results in Table 11. Although ConvNeXt-Small has more parameters and the re-encoder of EFMANet grows with backbone channel width, ConvNeXt-Small still keeps relatively low channel counts in the re-encoder, balancing efficiency and performance. It achieves the best mIoU and F1, indicating stronger feature representation and better adaptation to EFMANet. Therefore, we adopt ConvNeXt-Small as the backbone for EFMANet as an optimal trade-off between computational cost and final performance.

4.5.9. Complexity and Efficiency Analysis of Single Branch

The results in Table 12 illustrate the performance and computational trade-offs between EFMANet with and without the edge branch. The dual-branch EFMANet achieves higher mIoU of 85.92% and F1 score of 92.28% compared to the network without the edge branch, benefiting from the fusion of hierarchical image features and fine-grained edge information through the EFM module. Although this design slightly increases parameters to 63.90 million and FLOPs to 74.40 G, the gain in segmentation accuracy and edge preservation justifies the additional computational cost, demonstrating the effectiveness of incorporating edge features for remote sensing segmentation.

4.5.10. Complexity and Efficiency Analysis

Regarding model parameters and computational complexity, we conducted a comparison between our model and state-of-the-art models. As presented in Table 13, the results demonstrate that although our model has a higher parameter count and computational cost, it outperforms other models in terms of accuracy.

4.5.11. Segmentation Under Extreme Conditions

We also conducted experiments under extreme noise and partial occlusion. As show in Figure 12 in the first row of images, we applied noise and blur to the entire image. The resulting edge maps are filled with noise points, showing that the network cannot resist complete noise. Since the noise in the edge maps resembles low vegetation, most regions are predicted as low vegetation. Additionally, areas that should correspond to impervious surfaces without edges are mistakenly identified as background due to the added noise. The second row illustrates the occlusion scenario, where the network successfully detects the occluded regions and accurately identifies the occluded objects without causing misclassification.

5. Conclusions

In this paper, we propose EFMANet, a remote sensing semantic segmentation network that utilizes a dual-branch architecture to integrate edge information and incorporates spatial and channel dependencies. In the first stage of edge fusion, the Edge Fusion Module (EFM) integrates edge information into the encoder structure, differentiating it from other edge-based methods. This integration ensures that edge information is preserved during training, preventing edge loss and blurring. Additionally, the inclusion of object boundary contours provides the model with enhanced recognition capabilities, enabling better object identification through clear delineation of object boundaries. In the second stage, to combine spatial and channel information, we introduce the Multi-Dimensional Collaborative Fusion Attention (MCFA) module. The Multidimensional Spatial Attention (MSA) component fuses spatial and channel features, extracting horizontal and vertical information through pooling. Furthermore, convolutional and linear layers are employed to further refine the extracted features. The information is then reinforced in the corresponding directions of the original features via a broadcasting mechanism. To account for the unique characteristics of remote sensing images, we apply a fusion attention mechanism to combine global and local contextual information. The local features, processed by convolutional MSA, are fused with global representations through attention, enabling effective segmentation of the image from both global and local perspectives. The experiments validate the superiority of the EFMANet architecture and the effectiveness of its constituent modules. This work addresses challenges arising from complex object boundaries and intra and inter class variation in remote sensing imagery, future efforts will concentrate on architectural refinement and broader exploitation of edge information.

Author Contributions

Conceptualization, S.C. and Y.C.; methodology, Y.C.; software, Y.C.; validation, Y.C., S.C. and A.D.; resources, S.C.; data curation, Y.C.; writing—original draft preparation, Y.C.; Writing—review editing, Y.C.; visualization, Y.C.; supervision, S.C.; project administration, S.C.; funding acquisition, A.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of Xinjiang Uygur Autonomous Region under Grant 2025D01C50 and National Natural Science Foundation of China under Grant 62562058 and 62441213.

Data Availability Statement

Vaihingen and Potsdam datasets are available at: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/default.aspx (accessed on 1 January 2025). LoveDA dataset is available at: https://github.com/Junjue-Wang/LoveDA (accessed on 15 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, X.; Chen, H.; Zhao, Y.; He, M.; Han, X. Change detection of buildings in remote sensing images using a spatially and contextually aware Siamese network. Expert Syst. Appl. 2025, 276, 127110. [Google Scholar] [CrossRef]
Zhang, Y.; Gao, H.; Zhou, J.; Zhang, C.; Ghamisi, P.; Xu, S.; Li, C.; Zhang, B. A cross-modal feature aggregation and enhancement network for hyperspectral and LiDAR joint classification. Expert Syst. Appl. 2024, 258, 125145. [Google Scholar] [CrossRef]
Guo, H.; Su, X.; Wu, C.; Du, B.; Zhang, L. Building-Road Collaborative Extraction from Remote Sensing Images via Cross-Task and Cross-Scale Interaction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5617416. [Google Scholar] [CrossRef]
Zhang, F.; Yang, X. Improving land cover classification in an urbanized coastal area by random forests: The role of variable selection. Remote Sens. Environ. 2020, 251, 112105. [Google Scholar] [CrossRef]
Ma, L.; Li, Y.; Li, J.; Junior, J.M.; Gonçalves, W.N.; Chapman, M.A. Boundarynet: Extraction and completion of road boundaries with deep learning using mobile laser scanning point clouds and satellite imagery. IEEE Trans. Intell. Transp. Syst. 2021, 23, 5638–5654. [Google Scholar] [CrossRef]
Pan, D.; Zhang, M.; Zhang, B. A generic FCN-based approach for the road-network extraction from VHR remote sensing images–using OpenStreetMap as benchmarks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2662–2673. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, L.C. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Gao, Y.; Luo, X.; Gao, X.; Yan, W.; Pan, X.; Fu, X. Semantic segmentation of remote sensing images based on multiscale features and global information modeling. Expert Syst. Appl. 2024, 249, 123616. [Google Scholar] [CrossRef]
Sun, H.; He, X.; Li, H.; Kong, J.; Qiao, M.; Cheng, X.; Li, P.; Zhang, J.; Liu, R.; Shang, J. Adaptive sparse lightweight multi-scale hybrid network for remote sensing image semantic segmentation. Expert Syst. Appl. 2025, 280, 127347. [Google Scholar] [CrossRef]
Alexey, D. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhou, Y. A serial semantic segmentation model based on encoder-decoder architecture. Knowl.-Based Syst. 2024, 295, 111819. [Google Scholar] [CrossRef]
Chen, J.; Xu, S.; Zheng, Y. BaAFN: A Boundary-Aware Attention Fusion Network for Remote Sensing Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 20914–20928. [Google Scholar] [CrossRef]
Huang, W.; Deng, F.; Liu, H.; Ding, M.; Yao, Q. Multiscale Semantic Segmentation of Remote Sensing Images Based on Edge Optimization. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5616813. [Google Scholar] [CrossRef]
Guan, T.; Chang, S.; Deng, Y.; Xue, F.; Wang, C.; Jia, X. Oriented SAR Ship Detection Based on Edge Deformable Convolution and Point Set Representation. Remote Sens. 2025, 17, 1612. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 213–228. [Google Scholar]
Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O. A crossmodal multiscale fusion network for semantic segmentation of remote sensing data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3463–3474. [Google Scholar] [CrossRef]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Liu, Y.; Gao, K.; Wang, H.; Yang, Z.; Wang, P.; Ji, S.; Huang, Y.; Zhu, Z.; Zhao, X. A Transformer-based multi-modal fusion network for semantic segmentation of high-resolution remote sensing imagery. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104083. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O.; Liu, M. A multilevel multimodal fusion transformer for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403215. [Google Scholar] [CrossRef]
Song, X.; Jiao, L.; Li, L.; Liu, F.; Liu, X.; Yang, S.; Hou, B. MGPACNet: A Multiscale Geometric Prior Aware Cross-Modal Network for Images Fusion Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4412815. [Google Scholar] [CrossRef]
Cheng, H.; Wu, H.; Zheng, J.; Qi, K.; Liu, W. A hierarchical self-attention augmented Laplacian pyramid expanding network for change detection in high-resolution remote sensing images. ISPRS J. Photogramm. Remote Sens. 2021, 182, 52–66. [Google Scholar] [CrossRef]
Li, M.; Long, J.; Stein, A.; Wang, X. Using a semantic edge-aware multi-task neural network to delineate agricultural parcels from remote sensing images. ISPRS J. Photogramm. Remote Sens. 2023, 200, 24–40. [Google Scholar] [CrossRef]
Li, X.; Xie, L.; Wang, C.; Miao, J.; Shen, H.; Zhang, L. Boundary-enhanced dual-stream network for semantic segmentation of high-resolution remote sensing images. GISci. Remote Sens. 2024, 61, 2356355. [Google Scholar] [CrossRef]
Long, J.; Liu, S.; Li, M.; Zhao, H.; Jin, Y. BGSNet: A boundary-guided Siamese multitask network for semantic change detection from high-resolution remote sensing images. ISPRS J. Photogramm. Remote Sens. 2025, 225, 221–237. [Google Scholar] [CrossRef]
Li, J.; Wei, Y.; Wei, T.; He, W. A Comprehensive Deep-Learning Framework for Fine-Grained Farmland Mapping from High-Resolution Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5601215. [Google Scholar] [CrossRef]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and Multiscale Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2004612. [Google Scholar] [CrossRef]
Zeng, Q.; Zhou, J.; Tao, J.; Chen, L.; Niu, X.; Zhang, Y. Multiscale Global Context Network for Semantic Segmentation of High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622913. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive bilateral contextual network for efficient semantic segmentation of Fine-Resolution remotely sensed imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6506105. [Google Scholar] [CrossRef]
Li, R.; Wang, L.; Zhang, C.; Duan, C.; Zheng, S. A2-FPN for semantic segmentation of fine-resolution remotely sensed images. Int. J. Remote Sens. 2022, 43, 1131–1155. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607713. [Google Scholar] [CrossRef]
Zheng, J.; Shao, A.; Yan, Y.; Wu, J.; Zhang, M. Remote sensing semantic segmentation via boundary supervision-aided multiscale channelwise cross attention network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4405814. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Y.; Xiong, S.; Lu, X.; Zhu, X.X.; Mou, L. Integrating detailed features and global contexts for semantic segmentation in ultra-high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4703914. [Google Scholar]
Yang, Y.; Yuan, G.; Li, J. SFFNet: A Wavelet-Based Spatial and Frequency Domain Fusion Network for Remote Sensing Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 3000617. [Google Scholar] [CrossRef]
Mu, J.; Zhou, S.; Sun, X. PPMamba: Enhancing Semantic Segmentation in Remote Sensing Imagery by SS2D. IEEE Geosci. Remote Sens. Lett. 2025, 22, 6001705. [Google Scholar] [CrossRef]

Figure 1. Taking the ISPRS Vaihingen dataset as an example, the existing challenges in remote sensing semantic segmentation are demonstrated. (a) Trees and low shrubs have similar colors, making them difficult to distinguish. (b) The same building A appears differently in the image, while different buildings A and B belong to the same category but exhibit distinct visual characteristics. (c) The object boundaries in the image are complex and intertwined, with shadow occlusion issues also present.

Figure 2. The first stage is spatial feature extraction. Subsequently, the second stage performs various feature enhancement and fusion, including horizontal-vertical feature enhancement and fusion attention. Specifically, horizontal-vertical feature enhancement extracts image feature information from different dimensions, while fusion attention integrates information from various dimensions to achieve fusion of both global and local information.

Figure 3. The proposed EFM module is used to fuse the edge branch with the original encoder branch features. It consists of two branches for channel enhancement, and finally, the information from both branches is fused by addition.

Figure 4. The proposed MSA improvement module enhances the input data by swapping dimensions to strengthen the corresponding horizontal and vertical features. After pooling, the features are fused through a

1 \times 3

convolution and a linear layer, ultimately producing the enhanced output, and N represents the corresponding enhanced dimension.

Figure 4. The proposed MSA improvement module enhances the input data by swapping dimensions to strengthen the corresponding horizontal and vertical features. After pooling, the features are fused through a

1 \times 3

convolution and a linear layer, ultimately producing the enhanced output, and N represents the corresponding enhanced dimension.

Figure 5. The proposed MDFA module integrates the fused query with the edge-enhanced vertical and horizontal key-value pairs, effectively capturing boundary-aware spatial dependencies and enhancing feature representation in remote sensing segmentation.

Figure 6. The visualized comparative results on the Vaihingen dataset. The red square indicates the area that requires attention for comparison.

Figure 7. The visualized comparative results on the Potsdam dataset. The red square indicates the area that requires attention for comparison.

Figure 8. The visualized comparative results on the LoveDA dataset. The blue square indicates the area that requires attention for comparison.

Figure 9. Enlarged segmentation results before and after adding EFM in the Baseline. (a) Baseline (b) Baseline + EFM. These results demonstrate that the incorporation of the EFM module significantly enhances the accuracy of object recognition, particularly in challenging regions such as complex object boundaries and small objects affected by shadows or occlusions.

Figure 10. Heatmaps of each stage module. (a) Features output by the Baseline after passing through the encoder. (b) Fusion features of the encoder after incorporating the EFM module. (c) Output features after integrating the MSA module. (d) Features output after applying the MDFA module. The results suggest that the introduction of edge information enables the model to better focus on fine-scale objects, complex object boundaries, and targets in shadowed regions. In addition, the incorporation of spatial attention guides the model to capture the overall structural layout more effectively.

Figure 11. Enlarged segmentation results before and after adding MCFA in the Baseline. (a) Baseline (b) Baseline + MCFA. The results indicate that the use of the MCFA module, which integrates both global and local contextual information, leads to more accurate recognition of both large-scale and small-scale objects.

Figure 12. The image shows the prediction results under conditions of extreme noise and partial occlusion.

Table 1. Comparative experiments on Vaihingen Dataset.

Method	F1 (%)					Evaluation Index
Method	Imp.Surf.	Building	Lowveg.	Tree	Car	Mean F1 (%)	OA (%)	mIoU (%)
PSPNet [34]	95.67	93.22	83.11	88.78	77.41	87.64	91.66	78.62
DeeplabV3+ [35]	96.58	95.27	84.15	89.76	85.19	90.19	92.96	82.53
ABCNet [36]	96.45	94.81	84.16	90.19	83.97	89.92	92.79	82.08
DC-Swin [37]	96.96	96.09	84.85	90.24	86.83	90.99	93.51	83.84
$A^{2}$ -FPN [38]	96.78	95.56	84.63	90.08	88.47	91.10	93.26	83.99
MANet [39]	96.74	95.47	84.61	89.73	88.75	91.06	93.07	83.89
CMTFNet [32]	96.99	95.98	85.31	90.40	87.21	91.18	93.61	84.12
MCCANet [40]	96.31	94.41	84.20	89.96	83.93	89.76	92.68	81.81
MSGCNet [33]	96.86	95.86	85.45	90.57	89.63	91.67	93.60	84.90
MCSNet [41]	97.00	95.54	84.41	90.37	88.21	91.19	93.45	84.12
SFFNet [42]	97.06	95.75	85.06	90.14	90.33	91.67	93.51	84.91
PPMamba [43]	97.00	96.08	85.54	90.48	88.77	91.57	93.68	84.76
MSEONet [16]	96.78	95.75	84.48	90.16	88.22	91.08	93.26	83.95
DBBANet [31]	96.77	95.40	84.96	90.32	89.12	91.30	93.29	84.31
EFMANet	97.20	96.13	86.21	91.06	90.79	92.28	94.00	85.92

The best results are indicated in bold black, while the second-best result is in underline.

Table 2. Comparative experiments on Potsdam Dataset.

Method	F1 (%)					Evaluation Index
Method	Imp.Surf.	Building	Lowveg.	Tree	Car	Mean F1 (%)	OA (%)	mIoU (%)
PSPNet [34]	91.31	92.70	83.65	82.41	91.82	88.38	87.30	79.45
DeeplabV3+ [35]	93.85	96.20	87.20	88.79	95.83	92.37	91.16	86.04
ABCNet [36]	93.36	96.47	86.76	88.64	95.98	92.24	90.71	85.84
DC-Swin [37]	94.33	96.94	87.97	89.38	86.31	92.99	91.78	87.11
$A^{2}$ -FPN [38]	93.97	96.31	87.49	88.61	96.22	92.53	91.32	86.33
MANet [39]	93.78	96.44	87.63	89.16	96.41	92.69	91.42	86.59
CMTFNet [32]	94.58	97.01	88.24	89.22	96.24	93.06	91.96	87.25
MCCANet [40]	93.54	96.10	87.00	88.17	95.33	92.03	90.72	85.45
MSGCNet [33]	94.43	96.92	87.75	89.10	96.59	92.96	91.80	87.08
MCSNet [41]	94.11	96.64	87.72	88.82	95.79	92.62	91.44	86.46
SFFNet [42]	94.62	96.90	88.24	89.22	96.93	93.18	91.97	87.47
PPMamba [43]	94.54	96.95	88.29	89.48	96.47	93.15	92.07	87.38
MSEONet [16]	94.32	97.05	87.77	89.00	96.22	92.87	91.56	86.93
DBBANet [31]	94.24	96.55	87.68	88.92	96.46	92.77	91.49	86.74
EFMANet (ours)	94.96	97.34	88.55	89.86	96.87	93.52	92.53	88.03

The best results are indicated in bold black, while the second-best result is underlined.

Table 3. Comparative experiments on LoveDA Dataset.

Method	Background	Building	Road	Water	Barren	Forest	Agriculture	mIoU
PSPNet [34]	40.98	47.40	45.92	75.23	10.69	40.44	56.30	45.28
DeeplabV3+ [35]	53.41	58.61	54.61	69.28	28.55	43.11	52.07	51.27
ABCNet [36]	41.15	55.01	49.71	77.61	15.26	44.41	54.19	48.19
DC-Swin [37]	53.47	61.43	55.47	70.27	32.45	44.06	58.29	52.86
$A^{2}$ -FPN [38]	43.46	57.02	52.63	76.53	17.85	45.06	56.63	49.88
MANet [39]	46.63	54.18	53.56	83.81	15.07	45.00	51.51	49.97
CMTFNet [32]	45.47	56.85	60.20	78.43	18.67	47.31	57.34	52.04
MCCANet [40]	40.80	52.92	52.98	77.09	16.81	41.32	54.72	48.90
MSGCNet [33]	52.54	57.71	57.93	80.03	16.44	46.23	59.07	51.42
MCSNet [41]	45.14	61.77	58.12	82.10	17.38	47.15	61.36	53.29
SFFNet [42]	54.37	64.22	56.56	68.67	32.95	44.66	52.02	53.35
PPMamba [43]	54.15	63.28	55.25	68.36	34.96	40.86	51.30	52.59
MSEONet [16]	45.22	55.22	53.72	78.45	15.65	46.50	59.84	50.66
DBBANet [31]	45.21	55.40	53.04	77.81	15.17	45.26	57.75	49.95
EFMANet (ours)	47.25	60.89	58.43	81.66	18.36	49.25	56.48	54.48

The best results are indicated in bold black, while the second-best result is in underline.

Table 4. McNemar Test Results on the Vaihingen and Potsdam Datasets for Recent Semantic Segmentation Networks.

Network	Vaihingen				Potsdam
Network	B ( $10^{6}$ )	C ( $10^{6}$ )	$χ^{2}$ ( $10^{6}$ )	p	B ( $10^{6}$ )	C ( $10^{6}$ )	$χ^{2}$ ( $10^{6}$ )	p
PSPNet [34]	4.73	2.468	0.711	<0.001	40.837	13.515	13.735	<0.001
DeeplabV3+ [35]	2.691	2.112	0.070	<0.001	19.720	10.976	2.491	<0.001
ABCNet [36]	3.707	2.257	0.352	<0.001	19.822	10.959	2.552	<0.001
DC-Swin [37]	2.376	2.071	0.021	<0.001	12.545	10.799	0.131	<0.001
$A^{2}$ -FPN [38]	2.777	1.968	0.138	<0.001	16.621	11.186	1.062	<0.001
MANet [39]	3.534	2.495	0.179	<0.001	15.007	10.565	0.772	<0.001
CMTFNet [32]	2.331	1.916	0.041	<0.001	13.210	11.504	0.118	<0.001
MCCANet [40]	3.655	2.187	0.369	<0.001	19.281	10.770	2.411	<0.001
MSGCNet [33]	2.145	1.983	0.006	<0.001	13.704	11.141	0.264	<0.001
MCSNet [41]	2.639	2.057	0.072	<0.001	15.935	11.355	0.769	<0.001
SFFNet [42]	2.140	1.721	0.045	<0.001	11.627	10.313	0.079	<0.001
PPMamba [43]	2.100	1.871	0.013	<0.001	12.159	11.185	0.041	<0.001
MSEONet [16]	2.619	1.821	0.143	<0.001	13.555	9.700	0.639	<0.001
DBBANet [31]	2.506	1.812	0.111	<0.001	13.860	10.010	0.621	<0.001
EFMANet (ours)	1.987	1.654	0.008	<0.001	10.524	9.412	0.036	<0.001

Table 5. Ablation Experiments on the Vaihingen and Potsdam Datasets.

Method	Vaihingen		Potsdam
Method	mIoU (%)	F1 (%)	mIoU (%)	F1 (%)
EFMANet	85.92	92.28	88.03	93.52
EFMANet w/o EFM	85.15	91.81	87.04	92.95
EFMANet w/o MCFA	85.04	91.73	87.11	92.99
EFMANet w/o MCFA-H	85.15	91.80	87.25	93.07
EFMANet w/o MCFA-W	85.25	91.86	87.16	93.03
EFMANet w/o MDFA	85.48	92.01	87.51	93.23

Table 6. Results of adding individual models on baseline model on vaihingen dataset.

Method	F1 (%)					Evaluation Index
Method	Imp.Surf.	Building	Lowveg.	Tree	Car	Mean F1 (%)	OA (%)	mIoU (%)
Baseline	96.52	95.39	84.14	90.06	81.63	89.55	93.00	81.59
Baseline + EFM	96.69	95.80	84.98	90.29	83.18	90.19	93.32	82.58
Baseline + MCFA	96.64	95.57	84.70	90.09	82.99	90.00	93.16	82.28
Baseline + MCFA-H	96.56	95.47	84.72	90.22	81.89	89.77	93.14	81.94
Baseline + MCFA-W	96.59	95.52	84.71	90.17	82.12	89.82	93.15	82.02

Table 7. Comparison of different edge detection methods on segmentation performance.

Method	mIoU (%)	OA (%)	F1 (%)
Laplacian	85.43	93.78	91.92
Canny	85.85	93.99	92.20
Sobel (ours)	85.92	94.00	92.28

Table 8. Results of adding EFM on baseline model on Vaihingen Dataset. × indicates that the layer is not used, and ✓ indicates that the EFM of this layer is used.

EFM Modules				F1 (%)					Evaluation Index
EFM1	EFM2	EFM3	EFM4	Imp.Surf.	Building	Lowveg.	Tree	Car	Mean F1 (%)	OA (%)	mIoU (%)
×	×	×	×	96.52	95.39	84.14	90.06	81.63	89.55	93.00	81.59
✓	×	×	×	96.73	95.79	83.55	90.00	86.12	90.43	93.13	82.96
✓	✓	×	×	97.03	95.74	83.49	89.92	89.07	91.05	93.24	83.94
✓	✓	✓	×	97.09	95.96	83.56	90.02	90.05	91.34	93.35	84.41
✓	✓	✓	✓	97.12	96.06	85.59	90.72	89.29	91.74	93.78	85.04

Table 9. Comparison of different Attention modules on Vaihingen Dataset.

Method	F1 (%)					Evaluation Index
Method	Imp.Surf.	Building	Lowveg.	Tree	Car	Mean F1	OA	mIoU
Baseline	90.12	92.45	82.36	87.29	76.14	89.55	89.32	81.59
Baseline + ECA-CA	96.52	95.44	84.51	89.99	82.71	89.83	93.04	82.01
Baseline + SE	96.62	95.56	84.83	89.83	82.47	89.86	93.09	82.07
Baseline + CA	96.64	95.56	84.72	89.99	82.49	89.88	93.14	82.10
Baseline + MSA (ours)	90.73	92.89	83.19	87.88	78.05	90.00	89.81	82.28

Table 10. Comparison of MDFA concatenation and pixel-wise addition for feature fusion on Potsdam dataset.

Method	mIoU (%)	OA (%)	F1 (%)	Imp.Surf.	Building	Lowveg.	Tree	Car
MDFA-concat	75.75	89.41	85.90	94.58	88.89	78.66	87.53	79.78
MDFA-pixel add (ours)	85.92	94.00	92.28	97.20	96.13	86.21	91.06	90.79

Table 11. Comparison of parameters and mIoU of different backbone on Vaihingen Dataset.

Backbone	Params (M)	mIoU (%)
Resnet18	15.6	84.65
Resnext50	104.0	84.44
Convnext-Tiny	39.9	85.34
Convnext-Small	63.9	85.92

Table 12. Comparison of Single-Branch Network and EFMANet in terms of segmentation performance and computational cost.

Method	mIoU (%)	F1 (%)	Params (M)	FLOPs (G)
Single-Branch	85.15	91.81	58.20	50.18
EFMANet (Dual-Branch)	85.92	92.28	63.90	74.40

Table 13. Comparison of parameters and computation of different networks.

Method	Params (M)	Flops (G)
PSPNet [34]	53.3 M	28.1 G
DeeplabV3+ [35]	40.3 M	138.6 G
ABCNet [36]	13.6 M	15.7 G
DC-Swin [37]	66.9 M	72.1 G
$A^{2}$ -FPN [38]	22.8 M	41.7 G
MANet [39]	35.9 M	54.2 G
CMTFNet [32]	30.1 M	34.3 G
MCCANet [40]	43.0 M	65.1 G
MSGCNet [33]	27.6 M	29.1 G
MCSNet [41]	54.7 M	63.9 G
SFFNet [42]	34.1 M	52.0 G
PPMamba [43]	21.7 M	23.1 G
MSEONet [16]	47.19 M	45.25 G
DBBANet [31]	36.00 M	62.78 G
EFMANet (ours)	63.9 M	74.4 G

The best params and flops are indicated in bold black.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Cheng, S.; Du, A. EFMANet: An Edge-Fused Multidimensional Attention Network for Remote Sensing Semantic Segmentation. Remote Sens. 2025, 17, 3695. https://doi.org/10.3390/rs17223695

AMA Style

Chen Y, Cheng S, Du A. EFMANet: An Edge-Fused Multidimensional Attention Network for Remote Sensing Semantic Segmentation. Remote Sensing. 2025; 17(22):3695. https://doi.org/10.3390/rs17223695

Chicago/Turabian Style

Chen, Yunpeng, Shuli Cheng, and Anyu Du. 2025. "EFMANet: An Edge-Fused Multidimensional Attention Network for Remote Sensing Semantic Segmentation" Remote Sensing 17, no. 22: 3695. https://doi.org/10.3390/rs17223695

APA Style

Chen, Y., Cheng, S., & Du, A. (2025). EFMANet: An Edge-Fused Multidimensional Attention Network for Remote Sensing Semantic Segmentation. Remote Sensing, 17(22), 3695. https://doi.org/10.3390/rs17223695

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EFMANet: An Edge-Fused Multidimensional Attention Network for Remote Sensing Semantic Segmentation

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Single Modal Semantic Segmentation

2.2. Multimodal Semantic Segmentation

2.3. Boundary Enhancement

2.4. Attention Mechanisms

3. Methodology

3.1. Sobel Operator

3.2. Overall Architecture

3.3. Edge Fusion Module

3.4. Multi-Dimensional Collaborative Fusion Attention

3.4.1. Multidimensional Spatial Attention

3.4.2. Multi-Directional Fusion Attention Module

3.5. Loss Function

4. Experimental and Analysis

4.1. Datasets

4.2. Experimental Details

4.3. Evaluation Metrics

4.4. Comparative Experiments

4.4.1. Comparative Experiments on the Vaihingen Dataset

4.4.2. Comparative Experiments on the Potsdam Dataset

4.4.3. Comparative Experiments on the LoveDA Dataset

4.4.4. Comparative Experiments Using McNemar Test on the Vaihingen and Potsdam Datasets

4.5. Ablation Experiments

4.5.1. Ablation Experiments of EFMANet Components

4.5.2. Effectiveness of Sobel

4.5.3. Effectiveness of EFM

4.5.4. Effectiveness of Individual EFM on Baseline

4.5.5. Effectiveness of the MCFA

4.5.6. Effectiveness of the MSA

4.5.7. Effectiveness of Pixel-Wise MDFA Fusion

4.5.8. Different Backbone Analysis

4.5.9. Complexity and Efficiency Analysis of Single Branch

4.5.10. Complexity and Efficiency Analysis

4.5.11. Segmentation Under Extreme Conditions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI