MSA-UNet: Multiscale Feature Aggregation with Attentive Skip Connections for Precise Building Extraction

Yao, Guobiao; Chen, Yan; Sun, Wenxiao; Zhang, Zeyu; Tang, Yifei; Bi, Jingxue

doi:10.3390/ijgi14120497

Open AccessArticle

MSA-UNet: Multiscale Feature Aggregation with Attentive Skip Connections for Precise Building Extraction

by

Guobiao Yao

^1,2,

Yan Chen

¹,

Wenxiao Sun

¹,

Zeyu Zhang

¹,

Yifei Tang

¹ and

Jingxue Bi

^1,*

¹

School of Surveying and Geo-Informatics, Shandong Jianzhu University, Jinan 250101, China

²

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2025, 14(12), 497; https://doi.org/10.3390/ijgi14120497

Submission received: 19 October 2025 / Revised: 12 December 2025 / Accepted: 15 December 2025 / Published: 17 December 2025

(This article belongs to the Special Issue Spatial Data Science and Knowledge Discovery)

Download

Browse Figures

Versions Notes

Abstract

An accurate and reliable extraction of building structures from high-resolution (HR) remote sensing images is an important research topic in 3D cartography and smart city construction. However, despite the strong overall performance of recent deep learning models, limitations remain in handling significant variations in building scales and complex architectural forms, which may lead to inaccurate boundaries or difficulties in extracting small or irregular structures. Therefore, the present study proposes MSA-UNet, a reliable semantic segmentation framework that leverages multiscale feature aggregation and attentive skip connections for an accurate extraction of building footprints. This framework is constructed based on the U-Net architecture, incorporating VGG16 as a replacement for the original encoder structure, which enhances its ability to capture low-discriminative features. To further improve the representation of image buildings with different scales and shapes, a serial coarse-to-fine feature aggregation mechanism was used. Additionally, a novel skip connection was built between the encoder and decoder layers to enable adaptive weights. Furthermore, a dual-attention mechanism, implemented through the convolutional block attention module, was integrated to enhance the focus of the network on building regions. Extensive experiments conducted on the WHU and Inria building datasets validated the effectiveness of MSA-UNet. On the WHU dataset, the model demonstrated a state-of-the-art performance with a mean Intersection over Union (mIoU) of 94.26%, accuracy of 98.32%, F1-score of 96.57%, and mean Pixel accuracy (mPA) of 96.85%, corresponding to gains of 1.41% in mIoU over the baseline U-Net. On the more challenging Inria dataset, MSA-UNet achieved an mIoU of 85.92%, indicating a consistent improvement of up to 1.9% over the baseline U-Net. These results confirmed that MSA-UNet markedly improved the accuracy and boundary integrity of building extraction from HR data, outperforming existing classic models in terms of segmentation quality and robustness.

Keywords:

building extraction; multiscale feature aggregation; attention mechanism; skip connection

1. Introduction

Driven by the rapid evolution of remote sensing platforms, including spaceborne, airborne, and terrestrial systems, accurate building footprint extraction from images has emerged as an interdisciplinary research topic at the intersection of digital photogrammetry and computer vision. However, the inherent diversity in building types, a wide range of spatial scales, and the complexity of surrounding backgrounds in high-resolution (HR) images still pose difficulties for achieving fine-grained and boundary-accurate building extraction [1].

Conventional approaches to building extraction from HR images largely depend on low-level features, such as contrast [2], texture [3], intensity [4], and topological structure [5]. Typically, these methods apply pattern recognition algorithms under predefined threshold conditions to derive the extraction outcomes [6]. However, traditional methods largely rely on experience thresholds and manually defined models, which limit their ability to comprehensively and hierarchically represent building characteristics. Consequently, they often fall short of delivering fully automatic and reliable extraction of diverse building types from imagery.

In recent years, deep learning algorithms have been widely used for semantic segmentation of remote sensing imagery owing to their exceptional capacity for hierarchical feature extraction. Specifically, convolutional neural networks are capable of automatically identifying and classifying high-level semantic and low-level spatial features, thereby achieving fine-grained and pixel-level classification. At present, the encoder–decoder framework has become the mainstream architecture for semantic segmentation. In this structure, the encoder distills and compresses the input image into a low-resolution type and high-level semantic representation, while the decoder performs feature upsampling and reconstructs spatial details. Shelhamer et al. [7] introduced the fully convolutional network (FCN), a pioneering architecture that uses skip connections to effectively combine high-level semantic features with low-level boundary cues, marking the innovative realization of end-to-end semantic segmentation. Since then, various representative architectures have been proposed to enhance semantic segmentation performance. Classic CNN-based models, such as U-Net [8], PSPNet [9], and the DeepLab series [10,11,12], have been widely applied. More recently, Transformer-based architectures, including Swin-UNet [13] and SegFormer [14], and their variants [15,16], have also gained prominence. These deep learning networks can automatically capture complex spectral–spatial feature patterns in remote sensing imagery through end-to-end training, thereby enabling intelligent extraction of geospatial objects, such as buildings and roads [17]. The U-Net enhances the design of FCN by using a fully symmetric encoder–decoder structure, where the contracting path can capture multiscale contextual information, while the expansive path can be beneficial to precise spatial localization.

Nevertheless, existing encoder–decoder networks, such as FCN and U-Net, may still struggle to handle pronounced variations in building scale and density inherent in remote sensing imagery. To address this limitation, Oktay et al. [18] incorporated attention gates into the U-Net framework, allowing the network to implicitly filter out irrelevant regions and enhance salient features that can be helpful for object segmentation. Ye et al. [19] observed semantic inconsistencies in feature extraction at different stages and thus introduced a reweighting strategy prior to feature fusion to effectively eliminate such a disadvantage. Lin et al. [20] introduced prior knowledge of building edges and a multiobjective loss function. This approach markedly improves edge extraction accuracy for densely distributed, small-scale buildings in complex urban scenes. Xue et al. [21] proposed a comprehensive approach for rural building extraction from UAV imagery, incorporating a dilated convolution pyramid pooling module (PPM) to enhance multiscale feature capture. A multiscale fusion module, in conjunction with a coordinate attention mechanism, was used to optimize fine details and global dependencies. Guo et al. [22] innovatively integrated dilated and dense convolutions within the shallow layers of the network and introduced a multiscale fusion module to address the inherent trade-off between expanding the receptive field and preserving fine-grained details. Song et al. [23] proposed the DHI-Net model, which introduces a detail-preserving and hierarchical interaction module to optimize the feature extraction and fusion process.

In summary, FCN models, represented by the U-Net, use a single-path prediction architecture, which limits their ability to capture features across multiple spatial scales. Due to the repeated down-sampling operations in the encoder, fine-scale structures such as small buildings and thin boundaries tend to become blurred or partially lost. Furthermore, the existing skip connections and feature fusion mechanisms may not fully recover these details, as they can overlook the contextual relevance among multi-scale features. Consequently, in many cases, the loss of fine detail is more likely to be related to suboptimal feature fusion after encoder down-sampling than to the encoding process alone.

To mitigate the loss of fine details that results from suboptimal fusion of multi-scale features after encoder down-sampling, and to address the underutilization of effective building features during decoding, the present study proposes MSA-UNet, a U-Net-based building extraction network. By incorporating enhanced multiscale context modeling and more effective feature fusion within the U-Net framework, the proposed network is designed to better address the difficulties posed by buildings of varying scales, shapes, and surrounding environments in HR remote sensing imagery. Experimental results on the WHU and Inria aerial datasets indicated that MSA-UNet outperformed several state-of-the-art models, including U-Net [8], PSPNet [9], DeepLabV3+ [12], Attention U-Net [18], STTNet [15] and DSATNet [16], in terms of segmentation accuracy and boundary refinement. The main contributions of this article are given as follows.

(1): We design a multiscale feature aggregation (MSFA) module that jointly captures local and nonlocal information and aggregates multiscale context, thereby enhancing high-level feature representation.
(2): We integrate a convolutional block attention module (CBAM) into the decoder to adaptively reweight spatial locations and channels, thereby suppressing noise from irrelevant background regions and improving the response to building regions.
(3): We design an attentive skip connection (ASC) to dynamically emphasize salient features by computing attention-based correlations between encoder and decoder features, which strengthens multi-level feature fusion and refines building boundaries.

2. Methodology

Based on the U-Net framework, the present study proposes MSA-UNet, with its overall architecture depicted in Figure 1. This network mainly comprises three stages, namely, encoder, feature fusion, and decoder. The encoder adopts VGG16 [24] as the backbone feature extractor, generating five initial effective feature maps at different spatial scales. This choice is primarily driven by both performance and practicality. As a widely adopted encoder in U-Net–based building extraction networks for HR remote sensing imagery, VGG16 offers a favorable balance between representational capacity and computational efficiency, while its well-established ImageNet pretrained weights facilitate stable optimization and enhance generalization on the WHU and Inria building datasets. Immediately following the encoder, the MSFA module combines a PPM [9] and a self-modulation feature aggregation (SMFA) module [25] to fuse deep features across multireceptive fields. The decoder mirrors the encoder structure, where upsampling is performed using bilinear interpolation instead of transposed convolutions to recover the original image resolution. At the end of each decoder stage, a dual-attention module is embedded to optimize long- and short-range dependencies, enhancing the ability of the network to focus on critical information while suppressing irrelevant features. To further improve feature interaction and noise suppression, ASCs are used to adaptively fuse deep and shallow features across different levels. This mechanism effectively mitigates information loss and enhances the segmentation of fine-grained structures. A detailed comparison between MSA-UNet and the original U-Net is presented in Table 1.

2.1. MSFA Module

The HR imagery contains rich semantic information, and buildings within such images often demonstrate considerable variations in scale. To better accommodate these differences and achieve automatic extraction of building footprints with high precision, we designed an MSFA module to obtain deep high-level features at multiple scales. The MSFA module is composed of two serially connected components: the PPM and SMFA modules. Although recent studies have explored the integration of multiscale features with attention mechanisms, their architectural designs differ from ours in several key aspects. For example, classic models like DeepLabV3+ [12] and PSPNet [9] utilize parallel branching structures to capture multiscale context simultaneously. Similarly, DSAT-Net [16] and BEARNet [20] rely on dual-stream or parallel architectures to separately capture global context and edge details before fusion. These parallel approaches have shown promising results. However, they generally treat global context modeling and local detail recovery as separate or concurrent tasks. In contrast, our MSFA module employs a strictly serial, coarse-to-fine pipeline. Rather than treating PPM as the final global context representation, we consider it a coarse initial approximation. This is immediately followed by the SMFA module, which splits the features along the channel dimension into two parallel branches: the efficient approximation of self-attention (EASA) branch, designed to explicitly capture nonlocal contextual dependencies, and the local detail estimation (LDE) branch, which focuses on recovering fine-grained local details. This cascaded design allows the coarse global context from PPM to be progressively refined and sharpened within the SMFA module before being passed to the decoder. The detailed structure of the MSFA is illustrated in Figure 2.

Effectively leveraging global contextual information considerably enhances the understanding of complex scenes by the network. In deep convolutional networks, the size of the receptive field reflects the extent to which contextual information is used. A larger receptive field allows the network to integrate information from a wider spatial context, facilitating better feature abstraction and a more comprehensive understanding of the input data. The PPM is used to expand the receptive field and incorporate multilevel semantic cues. In the PPM, the feature map obtained from the backbone network undergoes average pooling with varying strides to generate intermediate feature representations across multiple scales. Specifically, a pooling layer with a stride of six is used to capture global contextual information, whereas three additional branches apply pooling operations with strides of 1, 2, and 3, respectively, to extract semantic information at different granularities. The pooling strides were set to {1, 2, 3, 6}, strictly following the standard configuration of the classic PPM in PSPNet [9]. This set of strides spans multiple spatial scales, striking a balance between capturing fine-grained local details and large-scale global context, thereby enabling the network to effectively integrate semantic cues at different granularities. To preserve the contribution of global features while alleviating computational complexity, a 1 × 1 convolution is applied to each pooled feature map to reduce the channel dimension to one-fourth (C/4) of the input. After applying the 1 × 1 convolution, we use Batch Normalization to normalize the feature distribution. This helps stabilize the training process and ensures consistent feature representation across iterations. These intermediate features are then upsampled through bilinear interpolation to match the spatial resolution of the input feature map. Finally, the upsampled feature maps from all four pyramid levels are concatenated with the original input feature map, which results in a comprehensive global feature

F_{P} \in R_{H \times W \times 2 C}

representation for this stage.

The SMFA module is a lightweight component designed to collaboratively fuse local and nonlocal features. To begin with, the input feature map is normalized and then its channel dimension is expanded by a 1 × 1 convolution. Subsequently, the resulting feature map is evenly split along the channel axis into two parts, where each branch receives C/2 channels given an input of C channels. These parts are separately fed into the EASA and LDE branches.

\{M, N\} = S ({C o n v}_{1 \times 1} ({∥ F_{P} ∥}_{2}))

(1)

where

{∥ \cdot ∥}_{2}

denotes L2 normalization,

{C o n v}_{1 \times 1} (\cdot)

represents a 1 × 1 convolutional layer, and

S (\cdot)

refers to channelwise splitting operation. Subsequently, the two subfeature maps M and N are processed in parallel by the EASA and LDE branches, which generate the nonlocal feature

M_{l}

and the local detail feature

N_{d}

, respectively. These two outputs are then concatenated and passed through a 1 × 1 convolutional layer to form the final representative output. This process can be formulated as follows:

F_{o u t} = {C o n v}_{1 \times 1} (M_{l} + N_{d})

(2)

where

F_{o u t}

denotes the final output features.

In the EASA branch, a downsampling operation is first applied to extract the low-frequency components of the input feature map. These downsampled features are then passed through a 3 × 3 depthwise convolution to generate nonlocal structural representation, denoted as

M_{S} \in R_{H / 8 \times W / 8 \times C}

M_{S} = D W {C o n v}_{3 \times 3} (D (M))

(3)

where

D (\cdot)

denotes an adaptive max pooling operation with a scaling factor of 8, which is used to obtain low-frequency representations, and

D W {C o n v}_{3 \times 3} (\cdot)

represents a 3 × 3 depthwise convolutional layer to generate nonlocal structural features. To incorporate a global description into the nonlocal representation

M_{S}

, the variance of

M

is introduced as a measure of spatial statistical dispersion as follows:

σ^{2} (M) = \frac{1}{Y} \sum_{i = 0}^{Y - 1} {(m_{i} - μ)}^{2}

(4)

where

σ^{2} (M) \in R_{1 \times 1 \times C}

denotes the variance set computed from

M

;

Y

, total number of pixels;

m_{i}

, value of a single pixel i; and

μ

, the mean value of all pixel intensities. This variance is then fused with the nonlocal representation

M_{S}

through a 1 × 1 convolutional layer as follows:

M_{x} = {C o n v}_{1 \times 1} (M_{S} + σ^{2} (M))

(5)

where

M_{x} \in R_{H \times W \times C}

denotes the modulated feature representation. This variance-based modulation mechanism facilitates a more effective capture of nonlocal information by enhancing the global context modeling capabilities.

Finally, the modulated features are used to aggregate the input feature

M

, resulting in the extraction of representative structural information, denoted as

M_{l}

as follows:

M_{l} = M \otimes U (\emptyset (M_{x}))

(6)

where

\emptyset (\cdot)

denotes the GELU activation function;

U (\cdot),

nearest-neighbor upsampling operation; and ⊗, the Hadamard product.

In the LDE branch, a 3 × 3 dilated depthwise convolution with a dilation rate of 2 is first applied to encode the local information

N_{h}

from the input feature map

N

as follows:

N_{h} = {C o n v}_{1 \times 1} ({D W C o n v}_{3 \times 3} (N))

(7)

Then, it is followed by two sequential 1 × 1 convolutional layers, interleaved with a GELU activation function, which together generate the enhanced local feature representation

N_{d}

as follows:

N_{d} = {C o n v}_{1 \times 1} (\emptyset (N_{h}))

(8)

To validate the enhancement effect of the MSFA module on high-level semantic representations, we conducted a comparative test using two randomly selected images. As shown in Figure 3, the visualized results of each test image included three feature maps: the original high-level feature map, the feature map refined by PPM, and the final feature map refined by SMFA. According to the visualization, the raw high-level feature maps exhibit blurred activations across building regions, with diffuse patterns that hinder precise target identification. After applying the PPM, the semantic activations in building areas become more prominent, resulting in more distinct architectural patterns in the heatmaps. The introduction of the SMFA module further refines the feature representation of buildings while reducing false activations in background areas. Overall, the proposed MSFA contributes to the structural integrity and spatial focus of the building areas.

2.2. Attention-Enhanced Feature Refinement

In semantic segmentation tasks, pixel-level prediction outputs must be upsampled to match the original image resolution. However, this upsampling process often leads to a loss of spatial detail, rendering the full reconstruction of fine-grained image information difficult. To enhance the transmission of meaningful features and suppress interference from irrelevant categories, a CBAM [26] is incorporated at the end of each of the four decoding stages in MSA-UNet. Specifically, four CBAMs are applied across all decoder levels to progressively refine features from high to low semantic levels. The detailed structure of CBAM is illustrated in Figure 4.

CBAM is composed of two sequential submodules: a channel attention module (CAM) and a spatial attention module (SAM). The CAM recalibrates the importance of each feature channel by enhancing semantic features that demonstrate high correlation with target regions, while suppressing irrelevant or less informative responses. The SAM focuses on critical spatial locations within the image, strengthening the recovery of edge contours and fine details. By working in tandem, the two modules effectively mitigate the loss of spatial accuracy and detail blurring caused by the upsampling process, thereby enhancing the overall quality of feature reconstruction. Let

F_{i n} \in R_{H \times W \times C}

denote the input feature map and

F_{o u t} \in R_{H \times W \times C}

the output feature map of the CBAM. The CBAM operation can be formulated as follows:

F_{o u t} = (F_{i n} \otimes M_{C}) \otimes M_{S}

(9)

where

M_{C} \in R_{1 \times 1 \times C}

denotes the channel attention-enhanced features and

M_{S} \in R_{H \times W \times 1}

denotes the spatial attention-enhanced feature. Their respective formulations are as follows:

\{\begin{array}{l} M_{C} = σ (M L P (A v g P o o l (F_{i n})) + M L P (M a x P o o l (F_{i n})) \\ M_{S} = σ (ƒ^{7 \times 7} [A v g P o o l (F_{i n}); M a x P o o l (F_{i n})]) \end{array}

(10)

where

A v g P o o l

denotes global average pooling,

M a x P o o l

denotes global max pooling, and

σ

denotes the Sigmoid activation function.

2.3. ASC Module

In traditional skip connections, the low-level texture information obtained from the encoder is directly fused with the high-level semantic information obtained from the decoder. Although this approach effectively preserves fine-grained details, such as building edges and shapes, it concatenates channels without considering the relative importance of each feature. This often leads to semantic inconsistencies between low-level encoder and high-level decoder features, resulting in noise from elements such as trees and shadows. To mitigate this limitation, the present study introduces an ASC specifically designed for 2D U-Net architectures. The proposed ASC separates feature reweighting from feature extraction into two distinct steps, thereby avoiding the challenge in traditional skip connections where a single convolutional layer must simultaneously handle both tasks. It is important to differentiate the proposed ASC gating from standard attention gates in Attention U-Net [18] or the gated fusion mechanisms in DSNet [22]. In Attention U-Net, the gating primarily acts as a unilateral filter, applying a scalar weight map exclusively to the encoder features to suppress irrelevant regions prior to concatenation, leaving the decoder features unaltered in this process. In contrast, our ASC gating adopts a complementary residual approach. Rather than merely masking the encoder features, it computes a trade-off map to dynamically balance the contributions from the decoder’s semantic context and the encoder’s spatial details. By applying complementary inverse weighting to these two streams, this approach enables adaptive fusion that preserves global semantics while incorporating key boundary details, potentially minimizing information loss over one-sided filtering. The detailed structure of the ASC is illustrated in Figure 5, where convolutional operations are first used to extract features, followed by an attention mechanism that adaptively adjusts the feature weights.

Formally, let

F_{e}^{i}, F_{d}^{i} \in R_{H \times W \times C}

denote the encoder and decoder feature maps at the

i

-th layer. The proposed ASC module first concatenates these features and projects them into an intermediate representation

F_{m} \in R_{H \times W \times C}

to capture joint spatial-semantic relationships. We define the gating mask

Q_{i} \in R_{H \times W \times C}

as:

Q_{i} = σ (C o n v 3 \times 3 (δ (C o n v 1 \times 1 ([F_{e}^{i}, F_{d}^{i}]))))

(11)

where

[\cdot, \cdot]

denotes channel-wise concatenation, and

δ

is the LeakyReLU activation. The final fused feature map

F_{p}^{i}

is obtained by:

F_{p}^{i} = (E - Q_{i}) \otimes F_{d}^{i} + Q_{i} \otimes F_{e}^{i}

(12)

where

E \in R_{H \times W \times C}

denotes a tensor of ones.

2.4. Joint Loss Function

The loss function of MSA-UNet integrates the binary cross-entropy (BCE) loss and the Dice loss. The BCE loss is a commonly used loss function in semantic segmentation tasks, primarily measuring the pixel-wise differences between the predicted results and the ground-truth labels. The BCE loss effectively optimizes the classification performance by reducing the discrepancy between the predicted class probability distribution and the true labels, making it particularly suitable for standard pixel-level binary classification problems. BCE loss is formulated as follows:

L_{B C E} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log (p_{i}) + (1 - y_{i}) l o g (1 - p_{i})]

(13)

where

N

denotes the total number of pixels and

y_{i}

takes a value of either 0 or 1, indicating the ground-truth label for

i

-th pixel, where 0 represents background and 1 denotes building class;

p_{i}

is a continuous value within the range [0, 1], representing the predicted probability that the

i

-th pixel belongs to the building class.

However, when confronted with issues regarding class imbalance, the BCE loss tends to neglect the fine-grained discrimination of small object regions. To address this limitation, the present study incorporates the Dice loss to complement the shortcomings of the BCE loss. The Dice loss is mainly based on the Dice coefficient, which measures the overlap between the predicted and ground-truth regions. It particularly focuses on the structural and morphological details of target objects, making it highly effective in improving the recognition of small buildings or other minority classes under class-imbalanced scenarios. The Dice loss is formulated as follows:

L_{D i c e} = 1 - \frac{2 \sum_{i = 1}^{N} y_{i} p_{i} + ϵ}{\sum_{i = 1}^{N} y_{i} + \sum_{i = 1}^{N} p_{i} + ϵ}

(14)

where

N

denotes the total number of pixels;

y_{i}

takes a value of either 0 or 1, indicating the ground-truth label for the

i

-th pixel, where 0 represents the background and 1 the building class;

p_{i}

is a continuous value within the range [0, 1], representing the predicted probability that the

i

-th pixel belongs to the building class; and

ϵ

is a constant smoothing factor added to prevent division by zero. In this study, we set

ϵ

to 1.

The total loss function

L_{t o t a l}

is constructed based on the BCE loss, by adding the Dice loss in a weighted manner to the base loss. This results in a joint loss function that can be formulated as:

L_{t o t a l} = a \cdot L_{B C E} + b \cdot L_{D i c e}

(15)

where

a

and

b

are weighting parameters used to balance the contribution of each loss term, ensuring that they remain on the same order of magnitude. In our implementation, we set the values of both

a

and

b

to 0.5.

3. Experimental Setup and Evaluation

3.1. Experimental Data

To evaluate the superiority of MSA-UNet, comparative experiments were conducted using the WHU [27] and Inria [28] datasets. The WHU aerial imagery dataset was developed by Wuhan University. It consists of HR remote sensing data covering approximately 450 km² in Christchurch and New Zealand. Furthermore, it includes vector annotations for over 220,000 individual buildings and provides a downsampled dataset at a spatial resolution of 0.3 m, containing 187,000 building samples. The dataset was standardized into 8189 image patches measuring 512 × 512 pixels using a non-overlapping sliding window strategy. Among these, 4736 images were used for training, 1036 for validation, and 2416 for testing.

The Inria aerial dataset covers a total area of 810 km² across five cities: Austin, Chicago, Vienna, Kitsap, and Tyrol. Each city is represented by 36-RGB orthoimages with a spatial resolution of 0.3 m, each image measuring 5000 × 5000 pixels, accompanied by corresponding binary ground-truth masks. For experimental purposes, the dataset was divided into smaller patches of 512 × 512 pixels using a non-overlapping sliding window strategy. The dataset was partitioned based on the IDs of the original orthoimages rather than random mixing. Specifically, for each city, the first 10 tiles were assigned to the test set, the subsequent 4 to the validation set, and the remaining 22 to the training set. This resulted in 8910 images for the training set, 1620 images for the validation set, and 4050 images for the test set.

3.2. Experimental Details and Environment Settings

All experiments in the present study were conducted on an Ubuntu 20.04 operating system, with an Intel^®Xeon^®W-2295CPU@3.00GHz (36 cores), 64 GB system RAM, 1 TB SSD storage, and an NVIDIA GeForce RTX 3090 GPU (24 GB video memory) to accelerate model training. The training process involved the use of pretrained weights from ImageNet. The deep learning environment was configured using Python 3.8, PyTorch 1.12.1, and CUDA 11.4. All random seeds were fixed to 11. The Adam optimizer was used with an initial learning rate of 0.0001 and a momentum parameter of 0.9. The model was trained for 50 epochs with a batch size of two. To accelerate training and mitigate overfitting, particularly when the training data were limited, the experiments adopted a two-stage training strategy comprising frozen and unfrozen stages. In the frozen stage, the backbone network was kept frozen and only the segmentation head was trained. In the unfrozen stage, the entire network was fine-tuned by allowing all layers to be updated. Both stages used the same optimizer configuration, and no separate learning rates were assigned to the frozen and unfrozen phases.

Figure 6 illustrates the decline in loss values during the training of MSA-UNet on the WHU and Inria datasets. It can be seen from the figure that the loss curves for the training and validation sets consistently decrease and eventually converge. Specifically, on the WHU dataset, the training and validation losses stabilize at approximately 0.058 and 0.101, respectively. On the Inria dataset, the final losses reach 0.184 for training and 0.259 for validation. This synchronous convergence of the two curves not only confirms the strong feature learning capability and data fitting potential of MSA-UNet but also clearly demonstrates the effectiveness of the hyperparameter settings and loss function design in the experiments.

3.3. Evaluation Metrics

To quantitatively evaluate the segmentation performance of MSA-UNet, four widely adopted metrics were used: mean Intersection over Union (mIoU), accuracy, F1-score, and mean Pixel accuracy (mPA), which assess the building extraction capabilities of a given model. Specifically, IoU measures the ratio of the intersection area to the union area between the predicted and ground-truth regions for a particular class, and mIoU represents the mean IoU across all classes. Accuracy is defined as the ratio of correctly predicted samples to the total number of samples. F1-score is the harmonic mean of precision and recall, which provides a balanced measure of classification performance. mPA calculates the proportion of correctly classified pixels for each class and then averages these proportions across all classes. The mathematical formulations of these metrics are as follows:

m I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{T P + F P + F N}

(16)

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(17)

m P A = \frac{1}{k} (\frac{T P}{T P + F P} + \frac{T N}{T N + F N})

(18)

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} = \frac{2}{2 + \frac{F P}{T P} + \frac{F N}{T P}}

(19)

where

k

denotes the number of classes; TP (true positive), number of pixels correctly classified as buildings; FP (false positive), number of background pixels incorrectly classified as buildings; FN (false negative), number of building pixels incorrectly classified as background; and TN (true negative), the number of background pixels correctly classified as background; Precision =

\frac{TP}{TP + FP}

, the proportion of building pixels correctly classified out of all the pixels predicted as buildings; Recall =

\frac{TP}{TP + FN}

, the proportion of building pixels correctly classified out of all the actual building pixels.

4. Experimental Results and Analysis

To validate the effectiveness of MSA-UNet in the building extraction, comparative experiments were conducted against several state-of-the-art methods, including U-Net [8], PSPNet [9], DeepLabV3+ [12], Attention U-Net [18], STTNet [15], and DSATNet [16]. All models were trained and tested under identical experimental conditions. Furthermore, a series of ablation studies was designed to rigorously evaluate the individual contributions and overall effectiveness of the integrated modules within MSA-UNet.

4.1. Ablation Study

To further confirm the effectiveness of the ASC, MSFA, and CBAM in MSA-UNet, a series of ablation experiments was conducted on the WHU and Inria datasets. In these experiments, we systematically investigated the performance improvements contributed by each of the three modules. Moreover, we evaluated the model complexity of each ablation setting in terms of the number of parameters and floating-point operations (FLOPs). All ablation configurations were trained under a unified experimental protocol, ensuring that the reported results are directly comparable across variants.

Table 2 presents the quantitative evaluation results of the ablation study, where ASC, MSFA, and CBAM represent the respective modules. The results indicated that the incorporation of all three modules—MSFA, CBAM, and ASC—into the baseline U-Net architecture led to great overall performance.

On the WHU dataset, the introduction of only the MSFA module improved the mIoU from 92.85% to 93.56%, suggesting that MSFA effectively enhanced the ability of the model to capture building boundaries. Upon further integrating the CBAM, the mIoU increased to 94.00%, indicating that the attention mechanism strengthened the weighting of critical features. Finally, when the ASC module was also incorporated, leveraging all three modules together, the mIoU further improved to 94.26%, and accuracy, F1-score, and mPA reached their highest values. This indicated that the ASCs alleviated the issue of deep feature loss via cross-level information interaction.

For the Inria dataset, the baseline model achieved an mIoU of 84.05%. When only the MSFA module was used, the mIoU increased to 85.39%, whereas the combination of CBAM and ASC yielded an mIoU of 85.21%, highlighting the complementary nature of the MSFA and attention mechanisms in complex urban scenes. When all three modules were jointly incorporated, the model achieved its peak performance, further confirming that the synergistic effect of these modules could considerably enhance the generalization capability. Notably, the performance gain of the ASC module on the Inria dataset was slightly lower than that on the WHU dataset. This could be attributed to the larger variation in building scales present in the Inria dataset, making the integration of multiscale features through cross-level skip connections critical for achieving optimal performance.

It is worth noting that the MSFA module consists of both PPM and SMFA components. To assess the specific contribution of the SMFA design, we conducted an additional isolation experiment comparing a PPM-only configuration with the complete MSFA module. On the WHU building dataset, incorporating PPM alone raised the baseline mIoU from 92.85% to 93.16%. Adding the SMFA module further improved performance to 93.56%. This comparison confirms that while PPM effectively provides a coarse multiscale global context, the proposed SMFA module further refines these features, delivering a noticeable yet modest performance improvement beyond standard pyramid pooling alone.

Furthermore, to evaluate the statistical significance of the proposed method and account for possible stochastic variance, we conducted repeated experiments using three different random seeds on the WHU building dataset. For the primary metric mIoU, the baseline U-Net achieved a mean of 92.81 ± 0.13%, while our proposed MSA-UNet reached 94.23 ± 0.11%. The observed improvement is considerably larger than the standard deviation, suggesting that the performance gains are statistically meaningful and not likely due to random initialization effects.

As shown in Figure 7, qualitative comparisons of different ablation settings on both the WHU and Inria datasets illustrate the progressive improvements introduced by each module. From left to right, the results reveal that the baseline U-Net struggles with fragmented building footprints and imprecise boundaries, particularly for small or irregular structures. The introduction of the MSFA module aids in recovering more complete building footprints, while the addition of CBAM enhances boundary accuracy by mitigating noise in cluttered regions. Finally, with the incorporation of the ASC module, the segmentation results exhibit sharp boundaries and improved delineation of small buildings, highlighting the synergistic effect of all three modules.

In summary, the ablation study demonstrated that the MSFA module enhanced the sensitivity of the model to fine building details through multiscale feature fusion, the CBAM optimized feature representation via channel and spatial attention, and the ASC module mitigated information loss through cross-level connections. When these three modules were jointly used, they markedly improved the accuracy and robustness of building footprint extraction, particularly in complex urban environments.

In addition to segmentation accuracy, we also quantify the computational overhead introduced by the proposed modules. As reported in Table 3, integrating the MSFA module increases the parameter count of the baseline U-Net from 24.90 M to 27.79 M, and the FLOPs from 92.04 G to 97.43 G. Incorporating CBAM on top of MSFA further raises the parameters only slightly to 27.92 M and the FLOPs to 98.50 G. The complete MSA-UNet, equipped with MSFA, CBAM, and ASC, contains 28.44 M parameters and requires 102.05 G FLOPs, corresponding to an overall increase of about 14.2% in parameters and 10.9% in FLOPs relative to the baseline U-Net. On the other hand, these moderate increases in model complexity are accompanied by consistent improvements in segmentation performance. As shown in Table 2, the variant equipped with all three modules achieves the highest mIoU, accuracy, F1-score, and mPA on both the WHU and Inria datasets, outperforming both the baseline U-Net and the partially enhanced variants. Additionally, the qualitative results presented in Figure 7 further demonstrate that the full MSA-UNet model provides more accurate and continuous boundaries. While the added complexity results in a modest increase in computational overhead, the observed improvements in segmentation accuracy and boundary delineation across both datasets suggest that this increase is justifiable. Therefore, we consider this trade-off between computational cost and segmentation performance to be a reasonable compromise for building extraction tasks.

4.2. Comparative Experiments and Analysis

4.2.1. WHU Dataset Results and Analysis

The quantitative evaluation results from different models on the WHU dataset are summarized in Table 4. As can be seen from the table, MSA-UNet consistently outperformed all the comparative models across all evaluation metrics. Compared with the baseline U-Net model, MSA-UNet improved the mIoU by 1.41%, with further increases of 0.62% in accuracy, 0.87% in F1-score, and 0.59% in mPA. MSA-UNet also achieved better performance than other state-of-the-art models, highlighting its superior ability to ensure completeness and accuracy in building extraction tasks. In addition, the parameter count and per-epoch runtime of MSA-UNet remain at a moderate level. As reported in Table 4, incorporating the MSFA, CBAM, and ASC modules increases the model size from 24.90 M parameters in the baseline U-Net to 28.44 M parameters, and raises the per-epoch training time on the WHU dataset from 3.42 min to 6.83 min. Nevertheless, its runtime remains comparable to other attention-based or Transformer-enhanced models, such as Attention U-Net and DSATNet. These results demonstrate that, although MSA-UNet introduces some additional computational cost, it maintains a reasonable trade-off between segmentation performance and computational efficiency.

Figure 8 presents the visual results of building extraction using seven different methods on the WHU dataset. Evidently, MSA-UNet demonstrated greater performance in building extraction with higher integrity and continuity while effectively suppressing overextraction issues and achieving significantly improved boundary localization accuracy compared with existing methods. For example, in the large-scale standalone building scenes of Experiments I–IV, when rooftops exhibit spectral similarities to surrounding surfaces, such as roads and bare soil, comparative models often suffer from feature confusion, resulting in blurred building boundaries and frequent misclassification of buildings as roads or omission as background. MSA-UNet addresses this issue by leveraging the MSFA module, which extracts features across diverse receptive fields in parallel. This alleviates the loss of critical information during building extraction and enhances the continuity of spatial representations, such as textures and colors, effectively suppressing hollow artifacts and feature discontinuities within large buildings. In more complex small-scale dense building scenes of Experiments V–VI, factors such as tree occlusion, varying illumination, and shadow effects often lead to overextraction in comparative models. As highlighted in the enlarged views of Experiments V and VI, MSA-UNet produces clearer building boundaries and more complete small structures than the comparative methods. The integration of the dual-attention mechanism in the decoder phase of MSA-UNet focuses on distinguishing features at building edges and between neighboring objects, enhancing the representation of salient features. Meanwhile, the ASC module effectively reduces noise from shadows and other background disturbances. Overall, the synergistic incorporation of the MSFA module, ASC module, and dual-attention mechanism in MSA-UNet markedly improves edge segmentation precision, thereby alleviating issues of false detection and overextraction.

4.2.2. Inria Dataset Results and Analysis

Table 5 presents the quantitative evaluation results of different models on the Inria dataset. Compared with the WHU dataset, the Inria dataset covers aerial imagery from multiple cities, introducing more complex urban scenarios, higher intraclass variation, and illumination differences across different seasons. Consequently, the overall accuracy of all models is generally lower on the Inria dataset than on the WHU dataset. Nevertheless, as presented in Table 5, MSA-UNet consistently demonstrated the best performance across all evaluation metrics. Compared with the baseline U-Net model, MSA-UNet improved the mIoU by 1.87%, with further increases of 0.96% in accuracy, 1.57% in F1-score, and 1.61% in mPA. Although these numerical gains are modest, they are consistent with the performance improvements observed on the WHU dataset. However, similar to the observations on the WHU dataset, the added model complexity inevitably incurs a longer runtime. Specifically, MSA-UNet requires 12.93 min per epoch on the Inria dataset, whereas the baseline U-Net takes only 6.35 min per epoch. Although the increased model capacity and runtime reflect the cost of the additional multiscale and attention modules, the training time of MSA-UNet remains on the same order as other attention-based or Transformer-enhanced baselines, such as Attention U-Net and DSATNet.

The visualized building extraction results for different networks on the Inria dataset are illustrated in Figure 9. From a visual evaluation perspective, the results generated by MSA-UNet were consistently closer to the ground-truth than those of the comparative models. In the images of Experiments I, III, and IV, several small buildings were omitted by the comparative models, indicating that the attention modules incorporated in MSA-UNet effectively focused on small targets and suppressed background interference, thereby improving segmentation accuracy. In the images of Experiments II and IV, owing to the spectral heterogeneity caused by diverse rooftop materials and interference from adjacent objects, comparative models exhibited noticeable topological discontinuities, which resulted in many hollow artifacts. The MSFA module in MSA-UNet addressed this issue by performing parallel multireceptive field feature extraction, enabling collaborative modeling of local details and global contextual information. This mitigates information degradation and enhances the integrity and continuity of the extracted buildings. In addition, the ASC module alleviates information loss through effective cross-level feature integration. In Experiments III, IV, and VI, tree occlusion and shadows around buildings lead to false detections of nonbuilding areas, blurred building contours, and boundary adhesion in the comparative models. The proposed model’s synergistic optimization of the MSFA module and ASCs enhances the expression of relevant features, thereby improving edge segmentation accuracy and markedly reducing false detections and redundant extractions. As highlighted in the zoomed-in views of Experiments V and VI, MSA-UNet better preserves the morphology of densely distributed small buildings and more effectively suppresses spurious activations along road-like structures than the comparative methods.

4.2.3. Accuracy Analysis of Building Boundary Extraction

To facilitate a more intuitive comparison of building boundary extraction accuracy across different networks, we further extracted contour lines from the predicted segmentation results. As illustrated in Figure 10 and Figure 11, the building outlines generated by MSA-UNet appear noticeably smoother and more well-defined than those of the competing methods, demonstrating a higher degree of alignment with ground-truth boundaries.

To quantitatively compare the classification accuracy of building boundary pixels extracted by different networks, five evaluation metrics—mIoU, Accuracy, F1-score, mPA, and the Hausdorff distance (HD)—are adopted. Here, TP refers to the number of correctly extracted boundary pixels, FP to the number of falsely extracted boundary pixels, and FN to the number of missed boundary pixels. The HD measures the maximum distance between the predicted and ground-truth boundary contours, where a lower value indicates more accurate boundary localization. As presented in Table 6, MSA-UNet exhibits a notable advantage in boundary delineation precision. On the WHU dataset, which contains dense urban layouts with well-defined building outlines, MSA-UNet achieves the highest mIoU, Accuracy, and mPA in our experiments. Its F1-score is only slightly lower than that of DSATNet, and its HD is comparable to the minimum HD reported for DSATNet. On the more challenging Inria dataset, characterized by greater background complexity and architectural diversity, MSA-UNet also performs favorably. It attains the highest mIoU, F1-score, mPA, and HD among the compared methods, although its Accuracy remains slightly below that of DSATNet. Overall, the results on both datasets suggest that, within the scope of our experiments, MSA-UNet can provide robust adaptability and relatively precise boundary localization, performing well in both structured urban environments and more heterogeneous scenes.

5. Conclusions

To address the limitations of U-Net in retaining fine-grained details and maintaining segmentation accuracy across multiscale scenes in HR remote sensing imagery, the present study proposes MSA-UNet that incorporates an MSFA module. The combination of pyramid pooling and SMFA in a serial manner enables the model to effectively process buildings at different scales in complex urban scenes and minimizes the loss of critical information during multiscale feature extraction. In addition, a parallel channel-spatial attention subnetwork is introduced to adaptively recalibrate the feature maps, thereby enhancing the response of target areas while suppressing background noise, further optimizing segmentation performance. Furthermore, a novel skip connection mechanism is employed to compute attention weights for features from different sources, considerably improving the denoising capability. Comparative experiments conducted on the publicly available WHU and Inria building datasets demonstrated that MSA-UNet shows better segmentation performance than the baseline U-Net model. When compared with several state-of-the-art semantic segmentation models, MSA-UNet attains comparable or slightly superior performance across most evaluation metrics. Despite these promising results, MSA-UNet still has a limitation. The MSFA and attention mechanisms inevitably introduce additional parameters and runtime, which may restrict the practicality of the model in scenarios with strict computational or latency constraints.

Future work will therefore focus on improving both the efficiency and the modeling capacity of the proposed framework. On the one hand, we plan to incorporate lightweight Transformer blocks and graph-based attention modules into the encoder–decoder architecture, which is expected to better capture hierarchical spatial relationships in complex building scenes. On the other hand, the aim is to deeply integrate the 2D segmentation results with 3D modeling pipelines, exploring 3D building reconstruction techniques based on multiview stereo vision or point cloud semantic segmentation. In addition, the inference speed of the network will be further optimized using model compression techniques, enabling lightweight deployment and facilitating the application of the proposed algorithm in real-time scenarios, such as dynamic monitoring in smart cities and rapid disaster assessment.

Author Contributions

Conceptualization, Guobiao Yao and Jingxue Bi; methodology, Guobiao Yao and Yan Chen; software, Wenxiao Sun; validation, Yan Chen and Zeyu Zhang; formal analysis, Yifei Tang; investigation, Guobiao Yao; resources, Jingxue Bi; data curation, Guobiao Yao; writing—original draft preparation, Guobiao Yao and Yan Chen; writing—review and editing, Guobiao Yao and Yan Chen; visualization, Guobiao Yao and Wenxiao Sun; supervision, Jingxue Bi; project administration, Guobiao Yao; funding acquisition, Guobiao Yao. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China with Project No. 42171435, the Shandong Provincial Natural Science Foundation with Project No. ZR2024QD012 and ZR2021MD006, the Postgraduate Education and Teaching Reform Foundation of Shandong Province with Project No. SDYJG19115, and the Undergraduate Education and Teaching Reform Foundation of Shandong Province with Project No. Z2021014. This work was also funded by the Youth Innovation Team Project of Higher School in Shandong Province with Project No. 2023KJ121.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, D.; Gao, X.; Yang, Y.; Guo, K.; Han, K.; Xu, L. Advances and future prospects in building extraction from high-resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6994–7016. [Google Scholar] [CrossRef]
Huang, X.; Zhang, L. Photogrammetric engineering & remote sensing. Photogramm. Eng. Remote Sens. 2011, 77, 721–732. [Google Scholar]
Turker, M.; Koc-San, D. Building extraction from high-resolution optical spaceborne images using the integration of support vector machine (SVM) classification, Hough transformation and perceptual grouping. Int. J. Appl. Earth Obs. Geoinf. 2015, 34, 58–69. [Google Scholar] [CrossRef]
Zhang, Y.; Roffey, M.; Leblanc, S.G. A novel framework for rapid detection of damaged buildings using pre-event LiDAR data and shadow change information. Remote Sens. 2021, 13, 3297. [Google Scholar] [CrossRef]
Jung, S.; Lee, K.; Lee, W.H. Object-based high-rise building detection using morphological building index and digital map. Remote Sens. 2022, 14, 330. [Google Scholar] [CrossRef]
Wang, J.; Liu, B.; Xu, K. Semantic segmentation of high-resolution images. Sci. China Inf. Sci. 2017, 60, 123101. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 833–848. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar] [CrossRef]
Chen, K.; Zou, Z.; Shi, Z. Building extraction from remote sensing images with sparse token transformers. Remote Sens. 2021, 13, 4441. [Google Scholar] [CrossRef]
Zhang, R.; Wan, Z.; Zhang, Q.; Zhang, G. DSAT-Net: Dual spatial attention transformer for building extraction from aerial images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6008405. [Google Scholar] [CrossRef]
Xia, L.; Mi, S.; Zhang, J.; Luo, J.; Shen, Z.; Cheng, Y. Dual-stream feature extraction network based on CNN and transformer for building extraction. Remote Sens. 2023, 15, 2689. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Le Folgoc, L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Ye, Z.; Fu, Y.; Gan, M.; Deng, J.; Comber, A.; Wang, K. Building extraction from very high resolution aerial imagery using joint attention deep neural network. Remote Sens. 2019, 11, 2970. [Google Scholar] [CrossRef]
Lin, H.; Hao, M.; Luo, W.; Yu, H.; Zheng, N. BEARNet: A novel buildings edge-aware refined network for building extraction from high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6005305. [Google Scholar] [CrossRef]
Xue, H.; Liu, K.; Wang, Y.; Chen, Y.; Huang, C.; Wang, P.; Li, L. MAD-UNet: A multi-region UAV remote sensing network for rural building extraction. Sensors 2024, 24, 2393. [Google Scholar] [CrossRef]
Guo, Z.; Bian, L.; Hu, W.; Li, J.; Ni, H.; Huang, X. DSNet: A novel way to use atrous convolutions in semantic segmentation. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 3679–3692. [Google Scholar] [CrossRef]
Song, B.; Shao, W.; Shao, P.; Wang, J.; Xiong, J.; Qi, C. DHI-Net: A novel detail-preserving and hierarchical interaction network for building extraction. IEEE Geosci. Remote Sens. Lett. 2024, 21, 2504605. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Zheng, M.; Sun, L.; Dong, J.; Pan, J. SMFANet: A lightweight self-modulation feature aggregation network for efficient image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 321–338. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? The INRIA aerial image labeling benchmark. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 3226–3229. [Google Scholar]

Figure 1. Overall architecture of MSA-UNet. The network is built upon a U-Net backbone, with an MSFA module inserted at the bottleneck to aggregate multiscale context. CBAM blocks in the decoder recalibrate feature responses, while ASCs at each scale enhance multiscale feature fusion for building extraction.

Figure 2. Architecture of the MSFA module, where the PPM branch aggregates contextual information at multiple pyramid scales and the SMFA branch adaptively reweights and fuses these features to enhance multiscale building representations.

Figure 3. Comparative visualization of feature response heatmaps based on two randomly selected images: (a) test images; (b) response maps of original high-level features; (c) response maps of PPM-enhanced results; and (d) SMFA-refined results.

Figure 4. Convolutional block attention module, where the channel attention branch selectively enhances semantically relevant feature channels and the spatial attention branch highlights critical spatial regions, jointly improving the quality of reconstructed feature maps.

Figure 5. Attentive skip connection module. Convolutional layers are used to fuse encoder and decoder features, while a pixel-wise gating branch generates attention weights to suppress noisy responses and highlight building-related structures. ⊗ denotes element-wise multiplication.

Figure 6. Loss value convergence trends: (a) Loss value decline on the WHU dataset; (b) Loss value decline on the Inria dataset. The numerical values denote the final training/validation losses on each dataset.

Figure 7. Qualitative ablation results on the WHU (top) and Inria (bottom) datasets: (a) original image; (b) ground-truth label; (c) Baseline U-Net; (d) Baseline + MSFA; (e) Baseline + MSFA + CBAM; and (f) Baseline + MSFA + CBAM + ASC (MSA-UNet). The red frames highlight specific regions for detailed comparison.

Figure 8. Sample building extraction results of various models on the WHU dataset: (a) original image; (b) ground-truth label; (c) U-Net; (d) PSPNet; (e) DeepLabV3+; (f) Attention U-Net; (g) STTNet; (h) DSATNet; and (i) MSA-UNet. Rows (v) and (vi) show the corresponding locally enlarged views of Experiments V and VI, respectively. The red frames highlight specific regions for detailed comparison.

Figure 9. Sample building extraction results of various models on the Inria dataset: (a) original image; (b) ground-truth label; (c) U-Net; (d) PSPNet; (e) DeepLabV3+; (f) Attention U-Net; (g) STTNet; (h) DSATNet; and (i) MSA-UNet. Rows (v) and (vi) show the corresponding locally enlarged views of Experiments V and VI, respectively. The red frames highlight specific regions for detailed comparison.

Figure 10. Sample building boundary extraction results of various models on the WHU dataset: (a) ground-truth label; (b) U-Net; (c) PSPNet; (d) DeepLabV3+; (e) Attention U-Net; (f) STTNet; (g) DSATNet; and (h) MSA-UNet. The red frames highlight specific regions for detailed comparison.

Figure 11. Sample building boundary extraction results of various models on the Inria dataset: (a) ground-truth label; (b) U-Net; (c) PSPNet; (d) DeepLabV3+; (e) Attention U-Net; (f) STTNet; (g) DSATNet; and (h) MSA-UNet. The red frames highlight specific regions for detailed comparison.

Table 1. Structural differences between the original U-Net and MSA-UNet.

Component	Original U-Net	MSA-UNet
Encoder	Stacked simple convolutional blocks	Pretrained VGG16 as backbone encoder
Multiscale Feature Fusion	Not included	PPM (pyramid pooling module) + SMFA (self-modulation feature aggregation)
Decoder	Transposed convolution for upsampling	Bilinear interpolation with dual-attention modules for dependency modeling
Skip Connections	Direct concatenation of encoder–decoder features	Attentive skip connections (ASC) for adaptive feature fusion

Table 2. Quantitative evaluation results of ablation experiments (the best values are in bold). “✓” indicates that the corresponding module is included in the model.

Dataset	Modules			mIoU/%	Acc/%	F1/%	mPA/%
Dataset	MSFA	CBAM	ASC	mIoU/%	Acc/%	F1/%	mPA/%
WHU				92.85	97.70	95.70	96.36
	✓			93.56	97.95	96.12	96.45
	✓	✓		94.00	98.10	96.31	96.66
		✓	✓	93.67	97.99	96.30	96.54
	✓	✓	✓	94.26	98.32	96.57	96.85
Inria				84.05	94.28	89.93	90.65
	✓			85.39	94.75	91.31	91.99
	✓	✓		85.77	95.04	91.40	92.14
		✓	✓	85.21	94.66	91.00	92.05
	✓	✓	✓	85.92	95.24	91.50	92.26

Table 3. Model complexity of different ablation settings.

Model	Params/M	FLOPs/G
Baseline UNet	24.90	92.04
Baseline + MSFA	27.79	97.43
Baseline + MSFA + CBAM	27.92	98.50
Baseline + MSFA + CBAM + ASC (MSA-UNet)	28.44	102.05

Table 4. Quantitative evaluation results of different models on the WHU dataset (the best values are presented in bold).

Model	mIoU/%	Accuracy/%	F1/%	mPA/%	Params/M	Runtime/(min/epoch)
U-Net	92.85	97.70	95.70	96.36	24.90	3.42
PSPNet	89.12	96.46	93.42	93.54	49.07	4.78
DeepLabV3+	90.96	96.76	94.05	93.83	41.22	4.30
Attention U-Net	93.12	97.82	95.63	95.96	35.60	7.42
STTNet	93.31	97.90	95.85	96.37	18.74	3.16
DSATNet	93.92	98.06	96.40	96.78	48.50	7.02
MSA-UNet	94.26	98.32	96.57	96.95	28.44	6.83

Table 5. Quantitative evaluation results of different models on the Inria dataset (the best values are presented in bold).

Model	mIoU/%	Accuracy/%	F1/%	mPA/%	Params/M	Runtime/(min/epoch)
U-Net	84.05	94.28	89.93	90.65	24.90	6.35
PSPNet	83.10	93.93	89.20	89.78	49.07	8.98
DeepLabV3+	82.38	93.52	88.46	87.70	41.22	8.27
Attention U-Net	84.37	94.52	90.53	91.21	35.60	13.85
STTNet	84.86	94.69	90.76	91.64	18.74	5.94
DSATNet	85.51	95.02	91.24	91.98	48.50	13.20
MSA-UNet	85.92	95.24	91.50	92.26	28.44	12.93

Table 6. Comparison of quantitative performance for building boundary pixels (the best values are presented in bold).

Model	WHU Dataset					Inria Dataset
Model	mIoU/%	Acc/%	F1/%	mPA/%	HD/Pixel	mIoU/%	Acc/%	F1/%	mPA/%	HD/Pixel
U-Net	39.5	98.2	63.6	50.7	100.3	26.7	95.0	47.5	40.1	93.7
PSPNet	27.8	96.3	52.3	35.5	112.2	16.7	94.4	36.3	25.2	111.8
DeepLabV3+	28.1	97.3	58.8	37.9	108.7	16.9	94.1	36.1	26.5	106.4
Attention U-Net	40.1	98.3	74.1	50.7	101.1	26.0	95.0	47.2	39.0	95.5
STTNet	39.8	98.3	77.0	48.9	90.6	22.7	94.9	42.9	33.7	95.2
DSATNet	42.6	98.4	79.3	51.6	84.5	28.5	95.5	51.0	39.8	86.3
MSA-UNet	42.9	98.5	78.6	52.7	86.2	29.1	95.2	51.1	42.8	85.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yao, G.; Chen, Y.; Sun, W.; Zhang, Z.; Tang, Y.; Bi, J. MSA-UNet: Multiscale Feature Aggregation with Attentive Skip Connections for Precise Building Extraction. ISPRS Int. J. Geo-Inf. 2025, 14, 497. https://doi.org/10.3390/ijgi14120497

AMA Style

Yao G, Chen Y, Sun W, Zhang Z, Tang Y, Bi J. MSA-UNet: Multiscale Feature Aggregation with Attentive Skip Connections for Precise Building Extraction. ISPRS International Journal of Geo-Information. 2025; 14(12):497. https://doi.org/10.3390/ijgi14120497

Chicago/Turabian Style

Yao, Guobiao, Yan Chen, Wenxiao Sun, Zeyu Zhang, Yifei Tang, and Jingxue Bi. 2025. "MSA-UNet: Multiscale Feature Aggregation with Attentive Skip Connections for Precise Building Extraction" ISPRS International Journal of Geo-Information 14, no. 12: 497. https://doi.org/10.3390/ijgi14120497

APA Style

Yao, G., Chen, Y., Sun, W., Zhang, Z., Tang, Y., & Bi, J. (2025). MSA-UNet: Multiscale Feature Aggregation with Attentive Skip Connections for Precise Building Extraction. ISPRS International Journal of Geo-Information, 14(12), 497. https://doi.org/10.3390/ijgi14120497

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

MSA-UNet: Multiscale Feature Aggregation with Attentive Skip Connections for Precise Building Extraction

Abstract

1. Introduction

2. Methodology

2.1. MSFA Module

2.2. Attention-Enhanced Feature Refinement

2.3. ASC Module

2.4. Joint Loss Function

3. Experimental Setup and Evaluation

3.1. Experimental Data

3.2. Experimental Details and Environment Settings

3.3. Evaluation Metrics

4. Experimental Results and Analysis

4.1. Ablation Study

4.2. Comparative Experiments and Analysis

4.2.1. WHU Dataset Results and Analysis

4.2.2. Inria Dataset Results and Analysis

4.2.3. Accuracy Analysis of Building Boundary Extraction

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI