A Boundary-Assisted Multi-Scale Transformer for Object-Level Building Extraction from Satellite Remote Sensing Imagery

Li, Suju; Wang, Haoran; Yao, Jing; Wu, Zhaoming; Chen, Zhengchao

doi:10.3390/electronics15061301

Open AccessArticle

A Boundary-Assisted Multi-Scale Transformer for Object-Level Building Extraction from Satellite Remote Sensing Imagery

by

Suju Li

¹,

Haoran Wang

^2,3

,

Jing Yao

²

,

Zhaoming Wu

^2,3

and

Zhengchao Chen

^2,*

¹

National Disaster Reduction Center of China, Beijing 100124, China

²

State Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China

³

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(6), 1301; https://doi.org/10.3390/electronics15061301

Submission received: 31 January 2026 / Revised: 7 March 2026 / Accepted: 11 March 2026 / Published: 20 March 2026

Download

Browse Figures

Versions Notes

Abstract

Building extraction is a core task in the semantic segmentation of satellite remote sensing imagery. Conventional pixel-level segmentation methods often prioritize texture over geometric structure, resulting in suboptimal performance in complex scenes affected by illumination variations, shadows, and scale changes. In this article, an innovative object-level building extraction approach is introduced to better capture the geometric structure of buildings, which incorporates superpixel segmentation to represent images as a set of adjacent regions. The proposed model consists of a cascade multi-scale fusion module (CMSFM) that progressively integrates contextual information across different receptive fields, along with a boundary-assisted loss function designed to enhance edge delineation and improve object-level accuracy. The experimental results on the WHU building dataset and the Massachusetts Buildings Dataset show that the proposed method notably outperforms other representative semantic segmentation approaches, such as FCN, UNet, DeepLab V3, and SETR. On the WHU dataset, MRLNet achieves the largest MIoU of 90.14% and the highest F1 score of 92.47%. On the Massachusetts Buildings Dataset, MRLNet attains the best MIoU of 83.14% and the highest F1 score of 90.46%. In addition, our building extraction model achieves a substantial performance improvement after the addition of the CMSFM module and the boundary-assisted loss function, demonstrating the effectiveness of these two enhancements used in our proposed model. It is expected that this research can provide a promising tool for the accurate extraction of buildings using satellite remote sensing images, which is indispensable in urban planning, disaster assessment, and other fields.

Keywords:

building extraction; remote sensing; semantic segmentation; transformer

1. Introduction

Buildings are fundamental components of urban environments, serving as the primary physical manifestations of human habitation, economic activity, and cultural identity [1,2,3]. Accurate extraction of buildings is critical for urban planning, population estimation, infrastructure monitoring, disaster assessment, and environmental modeling [4,5,6]. In particular, building footprint delineation can provide essential spatial information for analyzing land-use patterns, modeling urban heat islands, assessing energy consumption, and evaluating the resilience of urban systems [7,8]. Reliable building extraction enables timely updating of geospatial databases, supports precision in three-dimensional city modeling, and facilitates integration with socioeconomic and environmental datasets [2,9]. Given the complexity and heterogeneity of urban landscapes—where buildings vary in size, height, materials, and spectral characteristics—developing robust, automated extraction methods is of strategic importance for advancing smart city initiatives, optimizing resource allocation, and enhancing the scientific understanding of urbanization processes.

With the advancement of remote sensing techniques, satellite remote sensing imagery has been frequently employed in building extraction over the past few decades [4,5,9,10]. Building extraction has attracted substantial research interest, leading to the development of a wide range of building extraction approaches. Conventional building extraction algorithms are primarily developed using manually built features or auxiliary building-related information based on the spectral, textural, contextual, and geometric characteristics of buildings [11,12,13]. The classical approaches based on artificially building features consist of the building index, pixel shape index, and other methods [14,15,16]. The representative building-assisted approaches make use of terrain and surface models, digital elevation models, and airborne laser scanning data [14,15,16]. The land cover classification result can also be utilized to obtain building objects by approximating their locations and shapes [14]. These traditional building extraction approaches rely heavily upon prior knowledge and empirical experience through physical and statistical models.

In addition, various deep learning-based methods have been popularized in building extraction using convolutional neural network (CNN)-based and transformer-based approaches [17,18]. Building extraction is regarded as a single-target semantic segmentation, which is a fundamental task in the semantic segmentation of remote sensing imagery [5]. Conventional end-to-end segmentation networks, such as U-Net [19], SegNet [20], and DeepLab [21], can be utilized to extract buildings. However, these general CNN-based methods struggle with low accuracy and completeness when extracting buildings at multiple scales, due to limited capacity for multi-scale feature perception [5]. For this reason, several CNN-based methods have been specifically proposed for building extraction [22,23,24,25]. In recent years, transformer-based networks have also been adopted to model long-range dependencies across global building structures [5,26,27,28,29,30]. Examples include the dual-path vision Transformer in Wang et al. [30] for global spatial context extraction, the spatial attention Transformer in Zhang et al. [31] integrating local and global attention, and the Fusion-Former [13], which combines self-attention with depth-wise convolution for multi-scale building extraction. Multi-scale strategies have also emerged, such as the hierarchical vision Transformer with shifted windows [32] and the multi-scale attention network in Chang et al. [33], which applies channel and spatial attention to capture building features at multiple resolutions. Furthermore, hybrid CNN–Transformer networks have been proposed for building extraction [26,34,35]. For instance, Chang et al. [35] describe an asymmetric network combining CNNs and Transformers to fuse fine detail with long-range dependencies. Similarly, Yuan et al. [34] designed a CNN–Transformer hybrid with a modified multi-head self-attention mechanism to improve multi-scale feature learning.

Among the aforementioned methods, boundary-aware approaches such as E-D-Net [23] and MBR-HRNet [4] typically introduce additional boundary detection branches or post-processing modules to refine building contours, which increases model complexity. Multi-scale Transformer methods [32,33] primarily perform multi-scale fusion at the feature map level through attention mechanisms operating on image patches. In contrast, the approach proposed in this study differs in two key aspects: (a) rather than relying on additional network branches for boundary refinement, we incorporate boundary constraints directly into the loss function, thereby enhancing edge delineation without introducing extra architectural complexity and (b) instead of performing multi-scale fusion solely at the feature map level, we model multi-scale geometric information at the superpixel-region level, capturing finer-grained spatial relationships between pixels and their surrounding regions. These two design choices, namely superpixel-inspired multi-scale geometric modeling and boundary-assisted loss optimization, constitute the primary contributions of this work.

In this article, we continue the development of deep learning-based building extraction algorithms. A boundary-assisted multi-scale transformer is for the first time introduced to achieve object-level building extraction from satellite remote sensing imagery, named Multi-Scale Region Learning Network (MRLNet). This paper is organized as follows. Section 2 presents the description of satellite remote sensing datasets utilized in this research. A detailed explanation of our proposed algorithm employed for building extraction is shown in Section 3. In Section 4, we show the experimental results and discussions. Lastly, Section 5 concludes the key findings of this work.

2. Data and Method

2.1. Data

The building extraction data used in this study is sourced from the WHU Building Dataset https://gpcv.whu.edu.cn/data/building_dataset.html (accessed on 18 June 2025), supported by the Photogrammetry and Computer Vision Research Group (GPCV) at Wuhan University [36]. This publicly available dataset covers two urban regions—New Zealand and Christchurch—with a combined area exceeding 450 km². The dataset comprises 8188 optical remote sensing images, each cropped to 512 × 512 pixels, along with their corresponding labeled images. Each image has a spatial resolution of 0.3 m, enabling detailed representation of building features. In addition, the WHU building dataset is split into 4736 images for training, 1036 images for validation, and 2416 images for testing, providing a comprehensive and standardized benchmark for the development and assessment of the proposed algorithm.

To further validate the generalization capability of the proposed method, we additionally evaluate on the Massachusetts Buildings Dataset [37] https://www.cs.toronto.edu/~vmnih/data/ (accessed on 20 February 2026). This dataset was collected from the Boston metropolitan area at a spatial resolution of 1 m per pixel. It contains 151 aerial images of 1500 × 1500 pixels, split into 137 for training, 4 for validation, and 10 for testing. Compared with the WHU dataset, the Massachusetts Buildings Dataset features a more diverse urban landscape with varying building densities and complex backgrounds, providing a complementary benchmark for cross-dataset evaluation.

2.2. Method

Inspired by the concept of superpixel segmentation, Zhang et al. [38] proposed a semantic segmentation method called RegProxy, which integrates superpixel segmentation. In their method, an image is represented as a collection of adjacent regions, where each region learns both semantic and geometric pixel information through a Transformer. However, RegProxy incorporates geometric cues at only a single scale, which limits its effectiveness for remote sensing imagery characterized by diverse spatial resolutions and object sizes.

To address this limitation, we extend RegProxy by introducing a multi-scale module that enhances feature extraction across varying scales of remote sensing images. In addition, we design a boundary-assisted optimization loss function that explicitly strengthens the model’s ability to delineate building edges during training. Bringing these components together, we propose a vision Transformer architecture tailored for building extraction, which is called MRLNet, equipped with multi-scale feature learning and edge-aware optimization.

In this section, we first present the overall architecture of MRLNet for object-level remote sensing image building extraction, followed by a detailed description of the multi-scale feature fusion module and the boundary-assisted loss function proposed in this study.

2.2.1. Overview of MRLNet Architecture

Figure 1 illustrates the overall architecture of MRLNet, designed for building extraction from satellite remote sensing imagery. The network begins with a standard visual Transformer backbone. The input imagery is first divided into patches. Multiple affinity heads capture prospective multi-scale geometric relationships between pixels, while a parallel token branch converts the patches into tokens for the visual Transformer encoder. This encoder learns the semantic representation of each region through interactive feature modeling. The resulting region-level features are then passed to an MLP classifier to predict regional semantics. Finally, these semantic predictions are fused with the geometric pixel information to produce the final building extraction output.

2.2.2. Transformer as Region Encoder

The Transformer is a sequence modeling architecture built on the attention mechanism, designed to capture global features across an entire sequence. Its feature extraction and modeling approach aligns closely with the idea in this paper of representing an image as a series of patches. Therefore, we adopt the Transformer as the region encoder in our framework.

The input image is divided into equal-sized square patches, each transformed into a one-dimensional region feature through a linear projection. These features are augmented with positional embeddings and passed into a Transformer encoder. Across multiple Transformer layers, the region features interactively exchange information, enabling the learning of rich semantic representations and the formation of global contextual features. Finally, an MLP classifier assigns a semantic category to each region.

2.2.3. Multiscale Pixel Region Geometrics

To achieve accurate semantic segmentation results, it is essential to incorporate not only semantic information but also geometric information. Unlike natural images, remote sensing imagery covers a wide range of object scales—from small to large, from fine details to overall structures. A single remote sensing image may simultaneously contain buildings, roads, farmland, and other land-cover types, each exhibiting distinct scale characteristics. Multi-scale features are therefore crucial for capturing spatial structures and contextual information.

In remote sensing tasks, numerous studies have demonstrated the necessity and effectiveness of multi-scale features. To better address the specific challenges of building extraction, we incorporate multi-scale geometric information into each region, enabling the model to capture both fine-grained details and broader contextual patterns.

Inspired by the concept of superpixel segmentation, we propose a novel method for calculating geometric information. As illustrated in Figure 2, we first construct a grid of size

H \times W

, where

H \times W = N

, thereby dividing the image into N regions, each corresponding to a single grid unit. Each unit contains M geometric pixels. For any given geometric pixel, its spatial influence is distributed probabilistically across the unit it resides in and its eight neighboring units, for a total of nine units. Specifically, for a pixel p, the probability of belonging to the i-th unit among these nine is denoted as

q_{p i}

, with the constraint that the sum of probabilities over all nine units equals one:

\sum_{i = 1}^{9} q_{p i} = 1

(1)

Specifically, the probability

q_{p i}

is determined by computing the normalized inverse distance from pixel p to the center of each neighboring unit, followed by a softmax normalization to ensure that the probabilities sum to one. Pixels closer to a unit center receive higher probability values for that unit.

The geometric information of each pixel is integrated with the semantic information of its corresponding region via a convolution-based affinity head. After positional encoding, the affinity head enables each region to acquire pixel-level feature information. The parameter M, representing the number of geometric pixels contained in each unit, is configurable within the affinity head. By varying M, we can aggregate geometric data for each region at different spatial scales, thereby facilitating multi-scale feature extraction. In our implementation, M is set to 4, 8, and 16 across three parallel branches, producing geometric pixel information at resolutions of 1/16, 1/4, and 1/1 relative to the original image.

In order to better utilize multi-scale feature information, we propose a Cascade Multi-Scale Fusion Module (CMSFM) to gradually fuse low-resolution features into high-resolution features, as presented in Figure 3. The 1/16 geometric pixel features first go through a 3 × 3 convolution, are upsampled 4 times, and then are concatenated with the 1/4 geometric pixel features and input into a 3 × 3 convolution for two different scale resolutions. The features below are fused to obtain 1/4 geometric pixels. After a 3 × 3 convolution, the geometric pixel information of 1/1 size is fused according to the same process, and finally the fused geometric pixel features are obtained. Finally, by merging pixel features with region classification, assigning category labels to pixels within each region allows for effective building extraction results.

2.2.4. Boundary Assist Loss

Our proposed building extraction model incorporates the concept of superpixel segmentation, grouping pixels into blocks (regions) and assigning each a semantic label. Figure 4 presents the model’s prediction outputs at various stages of training when optimized solely with cross-entropy loss. In the early training phase, the model predominantly focuses on the central pixels within each region. As training progresses, attention gradually shifts toward the peripheral pixels, enabling more complete coverage of the aggregated region and improving boundary delineation.

Given these characteristics, we incorporate a boundary-aware loss function,

L_{Bdry}

[39], which explicitly measures boundary information. This term is integrated into our boundary-assist optimization loss to enhance the model’s capability in accurately delineating object boundaries. It is defined as [39]:

L_{B d r y} = \frac{1}{N} \sum_{i = 1}^{N} (1 - \frac{\partial y_{i}}{\partial n_{i}}) t_{i}

(2)

where

y_{i}

represents the predicted probability at pixel i,

t_{i}

denotes the ground-truth label, N is the total number of pixels, and

n_{i}

is the unit outward normal vector at the boundary near pixel i. The term

\nabla_{n} y_{i} = \frac{\partial y_{i}}{\partial n_{i}}

represents the directional derivative of the predicted probability map along the normal direction, which measures the sharpness of the prediction transition at boundary regions. As a consequence,

(1 - | \nabla_{n} y_{i} |)

can measure whether the pixel i is in the boundary area. A smaller

L_{Bdry}

value indicates that the predicted boundary information more closely matches the ground-truth labels.

In practice, the normal-direction gradient is computed following the implementation in [39]. The gradient of the predicted segmentation map is first estimated using horizontal and vertical Sobel operators, yielding partial derivatives

\partial y / \partial x

and

\partial y / \partial z

. The boundary normal direction is derived from the ground-truth label map using the same operators. The directional derivative along the normal is then obtained via the dot product of the predicted gradient vector and the unit normal vector. This computation is fully differentiable and introduces negligible overhead during training.

On this basis, the overall loss function proposed in this study is defined as

L_{B A O L} = α (L_{C E} + L_{D i c e}) - (1 - α) (λ L_{B d r y}) .

(3)

Among them,

α

is a parameter that gradually decreases with the number of training iterations. Let its initial value be

α_{0}

. When the i-th epoch is trained, the value of

α

is:

α_{i} = α_{0} \cdot θ^{i}

(4)

where

θ

is a constant between 0 and 1. A larger value of

θ

results in a slower decay of

α

over training iterations, whereas a smaller

θ

accelerates this decay. At the early stages of training under

L_{BAOL}

, the network primarily focuses on learning the semantics and aggregation of central pixel blocks. As training progresses and the contribution of the

L_{Bdry}

term increases, the network’s attention gradually shifts toward edge pixel learning, thereby improving boundary delineation and overall segmentation accuracy.

In our experiments, we set

α_{0} = 0.95

,

θ = 0.98

, and

λ = 1.0

. The initial value

α_{0} = 0.95

ensures that the model focuses predominantly on semantic learning during the early training phase. The decay factor

θ = 0.98

provides a gradual transition, allowing the boundary loss to increase its influence progressively. The weight

λ = 1.0

balances the magnitude of

L_{Bdry}

with the cross-entropy and Dice loss terms. During development, we observed that model performance remained stable across reasonable ranges of these hyperparameters (e.g.,

α_{0} \in [0.90, 0.99]

,

θ \in [0.95, 0.99]

), suggesting that the proposed loss formulation is not overly sensitive to precise hyperparameter tuning.

2.2.5. Verification Metrics

Two commonly used validation metrics, i.e., Mean Intersection over Union (MIoU) and F1-score, are used to measure the performance of our proposed model. The MIoU is generally computed on a per-class basis. Specifically, the IoU is first calculated for each class, and then the results are averaged across all classes. This yields a global evaluation metric that provides a more comprehensive assessment of the model’s overall performance. The F1 score takes into account both precision and recall, which are commonly used to evaluate model performance. They are written as

IoU = \frac{TP}{TP + FP + FN}

(5)

MIoU = \frac{1}{k} \sum_{i = 1}^{k} {IoU}_{i}

(6)

Precision = \frac{TP}{TP + FP}

(7)

Recall = \frac{TP}{TP + FN}

(8)

F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall}

(9)

where TP represents true positive that the real result is true and the predicted result is also true, FP represents false positive that the real result is false and the predicted result is true, and FN represents false negative that the real result is true and the predicted result is false.

3. Experimental Results

In this research, we conducted comparison experiments and ablation experiments using the WHU building dataset and the Massachusetts Buildings Dataset. All experiments were performed on an NVIDIA RTX3090 GPU.

3.1. Comparative Experiment

To evaluate the effectiveness of the proposed model, we conducted experiments under identical experimental settings and training strategies. We compared our approach against three classic semantic segmentation models, including FCN [40], UNet [19], and DeepLab V3 [41], as well as a Transformer-based model, SETR [42]. To ensure that the difference in parameter counts across experiments remains within a reasonable range, the encoders of FCN, UNet, and DeepLab V3 are all implemented with ResNet-101, while the encoder of the SETR model adopts the same Transformer-based architecture as used in this paper.

All models were tested on the WHU building dataset. The prediction results for the five models on the test set are presented in Figure 5. As demonstrated in Figure 5, our model achieved fewer false positives and false negatives with more regular and accurate boundary delineations, which outperforms all other methods. In addition, we quantitatively evaluated these models using the MIoU and F1 score, with the results summarized in Table 1. Among these methods, FCN achieved the lowest performance with an MIoU of 76.18% and an F1 score of 79.32%. The UNet, DeepLab V3, and SETR models showed MIoU values of 83.27%, 86.41%, and 88.33%, accompanied by F1 scores of 86.52%, 88.39%, and 90.26%, respectively. Our proposed MRLNet outperformed all other models, achieving the highest MIoU of 90.14% and F1 score of 92.47%, demonstrating its superior capability in accurate and consistent building extraction.

In large-scale building extraction tasks, the efficiency of the model is of critical importance. To comprehensively evaluate computational efficiency, we report the number of parameters, inference time per 512 × 512 crop, and frames per second (FPS) for all compared models, as summarized in Table 2. All inference time measurements were conducted on a single NVIDIA RTX3090 GPU with 50 warm-up iterations and averaged over 200 runs.

As shown in Table 2, MRLNet has 86.60 M parameters, which is fewer than SETR (98.00 M) but more than FCN (53.27 M), UNet (62.03 M), and DeepLab V3 (63.00 M). This is because MRLNet employs a Transformer-based backbone similar to SETR, which inherently requires more parameters than CNN-based architectures. However, in terms of inference speed, MRLNet achieves 78.6 FPS (12.7 ms per crop), which is significantly faster than SETR (36.8 FPS, 27.1 ms) and comparable to DeepLab V3 (76.3 FPS, 13.1 ms). Considering that MRLNet achieves the highest accuracy among all compared models while maintaining competitive inference speed, it offers the best balance between accuracy and efficiency for building extraction tasks.

3.2. Cross-Dataset Validation on Massachusetts Buildings Dataset

To further evaluate the generalization capability of the proposed method, we conducted additional experiments on the Massachusetts Buildings Dataset [37]. All models were trained and tested under the same experimental settings as described above. The quantitative results are summarized in Table 3. Results are reported as mean ± standard deviation over three independent runs.

As shown in Table 3, MRLNet achieves the best performance across all evaluation metrics, with an MIoU of 83.14% and an F1 score of 90.46%. Compared with the second-best model, UNet, MRLNet improves MIoU by 0.99 percentage points and F1 by 0.65 percentage points. The consistent superiority of MRLNet on these datasets, which have a coarser spatial resolution (1 m vs. 0.3 m) and more diverse urban landscapes than the WHU dataset, demonstrates the robustness and generalization ability of the proposed approach.

The visual comparison of prediction results on the Massachusetts Buildings Dataset is presented in Figure 6. MRLNet produces more complete building footprints with sharper boundaries compared to other methods, particularly for densely clustered buildings and buildings with complex backgrounds.

3.3. Ablation Experiment

To evaluate the effectiveness of the proposed multi-scale fusion strategy and edge optimization loss function, we conducted ablation experiments on the two improvements introduced in this study. The corresponding experimental results are summarized in Table 4.

The baseline model achieved an MIoU of 88.56% and an F1 score of 90.36%. The inclusion of the CMSFM module improved performance to an MIoU of 89.75% and an F1 score of 91.79%, whereas the application of the

L_{BAOL}

loss function alone achieved an MIoU of 89.28% and an F1 score of 91.54%. The combination of both CMSFM and

L_{BAOL}

achieved the highest performance, with an MIoU of 90.14% and an F1 score of 92.47%, demonstrating the complementary benefits of these two enhancements used in our proposed model. In general, compared with the baseline, incorporating the multi-scale fusion and edge optimization loss functions yields a substantial performance improvement, demonstrating the effectiveness of our proposed model.

To further validate the generalizability of these improvements, we also conducted ablation experiments on the Massachusetts Buildings Dataset. As shown in Table 5, the baseline model achieves an MIoU of 81.47% and an F1 score of 88.96%. The addition of the CMSFM module improves MIoU to 82.38%, while incorporating

L_{BAOL}

alone yields an MIoU of 82.05%. The combination of both components achieves the best performance with an MIoU of 83.14% and an F1 score of 90.46%, consistent with the trends observed on the WHU dataset.

Qualitatively, we observe that the baseline model tends to produce fragmented predictions with irregular boundaries, particularly for densely clustered buildings. The addition of CMSFM notably improves the completeness of building footprints by leveraging multi-scale geometric features. The incorporation of

L_{BAOL}

yields sharper and more regular boundary delineations. When both components are combined, the model achieves the best overall performance. These qualitative observations are further supported by the visual comparison presented in Figure 7.

4. Discussion

The experimental results on both the WHU building dataset (0.3 m resolution) and the Massachusetts Buildings Dataset (1 m resolution) consistently demonstrate the superiority of the proposed MRLNet over established CNN- and Transformer-based baselines. On the WHU dataset, MRLNet achieves an MIoU of 90.14% and an F1 score of 92.47%, while on the Massachusetts dataset, it attains an MIoU of 83.14% and an F1 score of 90.46%. The consistent performance gains across datasets with different spatial resolutions and urban characteristics confirm the generalization capability of the proposed approach.

From the perspective of model efficiency, MRLNet strikes a reasonable balance between accuracy and computational cost. With 86.60 M parameters and an inference speed of 78.6 FPS, it is significantly faster than the pure Transformer-based SETR (36.8 FPS) while achieving substantially higher accuracy. Although CNN-based models such as FCN and UNet exhibit faster inference speeds, their accuracy falls considerably behind that of MRLNet, indicating that the additional computational overhead of MRLNet is justified by its superior performance.

Compared with existing boundary-aware methods such as E-D-Net [23] and MBR-HRNet [4], which introduce additional boundary detection branches or post-processing modules, the proposed approach incorporates boundary constraints directly into the loss function. This design choice avoids increasing the model’s architectural complexity while still achieving effective boundary refinement. The ablation experiments confirm that the boundary-assisted loss (

L_{BAOL}

) and the cascade multi-scale fusion module (CMSFM) provide complementary benefits, with their combination yielding the best performance on both datasets.

Despite the promising results, this study has several limitations. First, the evaluation is limited to binary building extraction on two datasets, and the applicability of MRLNet to multi-class segmentation tasks remains to be investigated. Second, although we report that the model is not overly sensitive to hyperparameter choices within reasonable ranges, a systematic sensitivity analysis with quantitative results would provide stronger evidence. Third, while the proposed method achieves competitive inference speed, further optimization through lightweight backbone architectures could improve its suitability for real-time deployment scenarios.

5. Conclusions

In this study, the proposed MRLNet framework effectively integrates a Transformer-based region encoder, multi-scale pixel–region geometric modeling, and a cascade multi-scale fusion strategy to address the challenges of object-level building extraction in satellite remote sensing imagery. By combining semantic region features with fine-grained geometric information, the model captures both global context and local structural details, while the boundary-assisted loss function further enhances edge delineation accuracy.

Experimental results on both the WHU building dataset and the Massachusetts Buildings Dataset demonstrate that the MRLNet approach consistently outperforms established CNN- and Transformer-based baselines. On the WHU dataset, MRLNet achieved an MIoU of 90.14% and an F1 score of 92.47%, while on the Massachusetts Buildings Dataset, it attained an MIoU of 83.14% and an F1 score of 90.46%. These cross-dataset results confirm the robustness and generalization capability of the proposed approach, making it a promising solution for high-precision building extraction in large-scale remote sensing applications.

Several directions remain for future investigation. First, evaluation on additional datasets with higher diversity in building types and geographic regions would further validate generalizability. Second, exploring lightweight backbone architectures could improve inference efficiency for real-time applications. Third, extending the framework to multi-class semantic segmentation represents a natural progression. Finally, integrating building footprint extraction with height estimation for three-dimensional reconstruction is a promising direction.

Author Contributions

Conceptualization, S.L. and H.W.; methodology, S.L. and H.W.; software, S.L. and H.W.; validation, S.L., H.W., J.Y. and Z.W.; formal analysis, S.L. and H.W.; investigation, S.L. and Z.W.; resources, Z.W.; data curation, S.L.; writing: original draft preparation, S.L., H.W. and Z.W.; writing: review and editing, S.L., H.W., J.Y. and Z.W.; visualization, S.L.; supervision, Z.C.; project administration, Z.C.; funding acquisition, Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The WHU building data can be accessed at https://figshare.com/articles/dataset/WHU_building_dataset/25415098?file=45061306 (accessed on 18 June 2025). The Massachusetts Buildings Dataset can be accessed at https://www.cs.toronto.edu/~vmnih/data/ (accessed on 20 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MRLNet	Multi-Scale Region Learning Network
CMSFM	Cascade Multi-Scale Fusion Module
CNN	Convolutional Neural Network
MIoU	Mean Intersection over Union
WHU	Wuhan University

References

Naheed, S.; Shooshtarian, S. The Role of Cultural Heritage in Promoting Urban Sustainability: A Brief Review. Land 2022, 11, 1508. [Google Scholar] [CrossRef]
Huang, X.; Wen, D.; Li, J.; Qin, R. Multi-Level Monitoring of Subtle Urban Changes for the Megacities of China Using High-Resolution Multi-View Satellite Imagery. Remote Sens. Environ. 2017, 196, 56–75. [Google Scholar] [CrossRef]
Wu, G.; Shao, X.; Guo, Z.; Chen, Q.; Yuan, W.; Shi, X.; Xu, Y.; Shibasaki, R. Automatic Building Segmentation of Aerial Imagery Using Multi-Constraint Fully Convolutional Networks. Remote Sens. 2018, 10, 407. [Google Scholar] [CrossRef]
Yan, G.; Jing, H.; Li, H.; Guo, H.; He, S. Enhancing Building Segmentation in Remote Sensing Images: Advanced Multi-Scale Boundary Refinement with MBR-HRNet. Remote Sens. 2023, 15, 3766. [Google Scholar] [CrossRef]
Yang, F.; Jiang, F.; Li, J.; Lu, L. MSTrans: Multi-Scale Transformer for Building Extraction from HR Remote Sensing Images. Electronics 2024, 13, 4610. [Google Scholar] [CrossRef]
Guo, Z.; Shao, X.; Xu, Y.; Miyazaki, H.; Ohira, W.; Shibasaki, R. Identification of Village Building via Google Earth Images and Supervised Machine Learning Methods. Remote Sens. 2016, 8, 271. [Google Scholar] [CrossRef]
Yin, C.; Yan, J.; Yuan, M.; Tian, G.; Wen, Q.; Wang, L.; Li, L. How Does Built Environment Affect the Urban Heat Island Effect? A Systematic Framework Integrating Land Use, Building Form, and Road Network. Environ. Dev. Sustain. 2025. [Google Scholar] [CrossRef]
Wang, S.; Wang, Z.; Zhang, Y.; Fan, Y. Characteristics of Urban Heat Island in China and Its Influences on Building Energy Consumption. Appl. Sci. 2022, 12, 7678. [Google Scholar] [CrossRef]
Xu, Y.; Wu, L.; Xie, Z.; Chen, Z. Building Extraction in Very High Resolution Remote Sensing Imagery Using Deep Learning and Guided Filters. Remote Sens. 2018, 10, 144. [Google Scholar] [CrossRef]
Hu, Q.; Zhen, L.; Mao, Y.; Zhou, X.; Zhou, G. Automated Building Extraction Using Satellite Remote Sensing Imagery. Autom. Constr. 2021, 123, 103509. [Google Scholar] [CrossRef]
Weidner, U.; Förstner, W. Towards Automatic Building Extraction from High-Resolution Digital Elevation Models. ISPRS J. Photogramm. Remote Sens. 1995, 50, 38–49. [Google Scholar] [CrossRef]
Jin, X.; Davis, C.H. Automated Building Extraction from High-Resolution Satellite Imagery in Urban Areas Using Structural, Contextual, and Spectral Information. Eurasip J. Adv. Signal Process. 2005, 2005, 1–11. [Google Scholar] [CrossRef]
Fan, Z.; Wang, S.; Pu, X.; Wei, H.; Liu, Y.; Sui, X.; Chen, Q. Fusion-Former: Fusion Features across Transformer and Convolution for Building Change Detection. Electronics 2023, 12, 4823. [Google Scholar] [CrossRef]
Lee, D.S.; Shan, J.; Bethel, J.S. Class-Guided Building Extraction from Ikonos Imagery. Photogramm. Eng. Remote Sens. 2003, 69, 143–150. [Google Scholar] [CrossRef]
Bi, Q.; Qin, K.; Zhang, H.; Zhang, Y.; Li, Z.; Xu, K. A Multi-Scale Filtering Building Index for Building Extraction in Very High-Resolution Satellite Imagery. Remote Sens. 2019, 11, 482. [Google Scholar] [CrossRef]
Huang, X.; Zhang, L. Morphological Building/Shadow Index for Building Extraction From High-Resolution Imagery Over Urban Areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 161–172. [Google Scholar] [CrossRef]
Dong, X.; Cao, J.; Zhao, W. A Review of Research on Remote Sensing Images Shadow Detection and Application to Building Extraction. Eur. J. Remote Sens. 2024, 57, 2293163. [Google Scholar] [CrossRef]
Li, Q.; Mou, L.; Sun, Y.; Hua, Y.; Shi, Y.; Zhu, X.X. A Review of Building Extraction From Remote Sensing Imagery: Geometrical Structures and Semantic Attributes. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4702315. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Wu, T.; Hu, Y.; Peng, L.; Chen, R. Improved Anchor-Free Instance Segmentation for Building Extraction from High-Resolution Remote Sensing Images. Remote Sens. 2020, 12, 2910. [Google Scholar] [CrossRef]
Zhu, Y.; Liang, Z.; Yan, J.; Chen, G.; Wang, X. E-D-Net: Automatic Building Extraction From High-Resolution Aerial Images With Boundary Information. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4595–4606. [Google Scholar] [CrossRef]
Tian, Q.; Zhao, Y.; Li, Y.; Chen, J.; Chen, X.; Qin, K. Multiscale Building Extraction with Refined Attention Pyramid Networks. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8011305. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. A Scale Robust Convolutional Neural Network for Automatic Building Extraction from Aerial and Satellite Imagery. Int. J. Remote Sens. 2019, 40, 3308–3322. [Google Scholar] [CrossRef]
Xia, L.; Mi, S.; Zhang, J.; Luo, J.; Shen, Z.; Cheng, Y. Dual-Stream Feature Extraction Network Based on CNN and Transformer for Building Extraction. Remote Sens. 2023, 15, 2689. [Google Scholar] [CrossRef]
Hu, A.; Wu, L.; Xu, Y.; Xie, Z. SANET: A Shape-Aware Building Footprints Extraction Method in Remote Sensing Images by Integrating Fourier Shape Descriptors. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5632215. [Google Scholar] [CrossRef]
Gibril, M.B.A.; Al-Ruzouq, R.; Shanableh, A.; Jena, R.; Bolcek, J.; Shafri, H.Z.M.; Ghorbanzadeh, O. Transformer-Based Semantic Segmentation for Large-Scale Building Footprint Extraction from Very-High Resolution Satellite Images. Adv. Space Res. 2024, 73, 4937–4954. [Google Scholar] [CrossRef]
Yiming, T.; Tang, X.; Shang, H. A Shape-Aware Enhancement Vision Transformer for Building Extraction from Remote Sensing Imagery. Int. J. Remote Sens. 2024, 45, 1250–1276. [Google Scholar] [CrossRef]
Wang, L.; Fang, S.; Meng, X.; Li, R. Building Extraction with Vision Transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625711. [Google Scholar] [CrossRef]
Zhang, R.; Wan, Z.; Zhang, Q.; Zhang, G. DSAT-Net: Dual Spatial Attention Transformer for Building Extraction From Aerial Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6008405. [Google Scholar] [CrossRef]
Chen, X.; Qiu, C.; Guo, W.; Yu, A.; Tong, X.; Schmitt, M. Multiscale Feature Learning by Transformer for Building Extraction From Satellite Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 2503605. [Google Scholar] [CrossRef]
Chang, J.; He, X.; Li, P.; Tian, T.; Cheng, X.; Qiao, M.; Zhou, T.; Zhang, B.; Chang, Z.; Fan, T. Multi-Scale Attention Network for Building Extraction from High-Resolution Remote Sensing Images. Sensors 2024, 24, 1010. [Google Scholar] [CrossRef] [PubMed]
Yuan, Q.; Xia, B. Cross-Level and Multiscale CNN-Transformer Network for Automatic Building Extraction from Remote Sensing Imagery. Int. J. Remote Sens. 2024, 45, 2893–2914. [Google Scholar] [CrossRef]
Chang, J.; Cen, Y.; Cen, G. Asymmetric Network Combining CNN and Transformer for Building Extraction from Remote Sensing Images. Sensors 2024, 24, 6198. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Mnih, V. Machine Learning for Aerial Image Labeling. Ph.D. Thesis, University of Toronto, Toronto, ON, Canada, 2013. [Google Scholar]
Zhang, Y.; Pang, B.; Lu, C. Semantic Segmentation by Early Region Proxy. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1248–1258. [Google Scholar]
Wang, C.; Zhang, Y.; Cui, M.; Ren, P.; Yang, Y.; Xie, X.; Hua, X.S.; Bao, H.; Xu, W. Active Boundary Loss for Semantic Segmentation. Proc. Aaai Conf. Artif. Intell. 2022, 36, 2397–2405. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6877–6886. [Google Scholar]

Figure 1. The general structure of MRLNet for object-level remote sensing image building extraction.

Figure 2. Schematic illustration of geometry within regions.

Figure 3. The schematic of CMSFM.

Figure 4. Visualization of prediction results at different stages of the model training process. The black and white denote true negative and true positive, respectively.

Figure 5. Comparison of FCN, UNet, DeepLab V3, SETR, and MRLNet approaches based on WHU building dataset. The black, white, green, and red denote true negative, true positive, false negative, and false positive, respectively.

Figure 6. Comparison of FCN, UNet, DeepLab V3, SETR, and MRLNet approaches based on Massachusetts Buildings Dataset. The black, white, green, and red denote true negative, true positive, false negative, and false positive, respectively.

Figure 7. Ablation study visualization on the Massachusetts Buildings Dataset. From left to right: input image, ground truth, Baseline, Baseline + CMSFM, Baseline +

L_{BAOL}

, and Baseline + CMSFM +

L_{BAOL}

. The black, white, green, and red denote true negative, true positive, false negative, and false positive, respectively.

Figure 7. Ablation study visualization on the Massachusetts Buildings Dataset. From left to right: input image, ground truth, Baseline, Baseline + CMSFM, Baseline +

L_{BAOL}

, and Baseline + CMSFM +

L_{BAOL}

. The black, white, green, and red denote true negative, true positive, false negative, and false positive, respectively.

Table 1. Comparison of FCN, UNet, DeepLab V3, SETR, and MRLNet approaches based on WHU building dataset. Results are reported as mean ± standard deviation over three independent runs. The best results are highlighted in bold.

Model	MIoU	F1
FCN	76.18 ± 0.35%	79.32 ± 0.31%
UNet	83.27 ± 0.28%	86.52 ± 0.24%
DeepLab V3	86.41 ± 0.22%	88.39 ± 0.19%
SETR	88.33 ± 0.26%	90.26 ± 0.21%
MRLNet (Ours)	90.14 ± 0.18%	92.47 ± 0.15%

Table 2. Comparison of model complexity and inference efficiency across FCN, UNet, DeepLab V3, SETR, and MRLNet. Inference time and FPS are measured on a single NVIDIA RTX3090 GPU with input size 512 × 512.

Model	Params (M)	Time (ms)	FPS
FCN	53.27	10.1	98.6
UNet	62.03	9.9	100.9
DeepLab V3	63.00	13.1	76.3
SETR	98.00	27.1	36.8
MRLNet (Ours)	86.60	12.7	78.6

Table 3. Comparison of FCN, UNet, DeepLab V3, SETR, and MRLNet approaches based on Massachusetts Buildings Dataset. Results are reported as mean ± standard deviation over three independent runs. The best results are highlighted in bold.

Model	MIoU	F1
FCN	81.20 ± 0.32%	89.20 ± 0.27%
UNet	82.15 ± 0.27%	89.81 ± 0.23%
DeepLab V3	81.26 ± 0.30%	89.24 ± 0.26%
SETR	80.93 ± 0.34%	89.01 ± 0.29%
MRLNet (Ours)	83.14 ± 0.21%	90.46 ± 0.17%

Table 4. Ablation study results on the WHU building dataset. The best results are highlighted in bold.

Model	MIoU	F1
Baseline	88.56%	90.36%
Baseline + CMSFM	89.75%	91.79%
Baseline + $L_{BAOL}$	89.28%	91.54%
Baseline + CMSFM + $L_{BAOL}$	90.14%	92.47%

Table 5. Ablation study results on the Massachusetts Buildings Dataset. The best results are highlighted in bold.

Model	MIoU	F1
Baseline	81.47%	88.96%
Baseline + CMSFM	82.38%	89.72%
Baseline + $L_{BAOL}$	82.05%	89.48%
Baseline + CMSFM + $L_{BAOL}$	83.14%	90.46%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, S.; Wang, H.; Yao, J.; Wu, Z.; Chen, Z. A Boundary-Assisted Multi-Scale Transformer for Object-Level Building Extraction from Satellite Remote Sensing Imagery. Electronics 2026, 15, 1301. https://doi.org/10.3390/electronics15061301

AMA Style

Li S, Wang H, Yao J, Wu Z, Chen Z. A Boundary-Assisted Multi-Scale Transformer for Object-Level Building Extraction from Satellite Remote Sensing Imagery. Electronics. 2026; 15(6):1301. https://doi.org/10.3390/electronics15061301

Chicago/Turabian Style

Li, Suju, Haoran Wang, Jing Yao, Zhaoming Wu, and Zhengchao Chen. 2026. "A Boundary-Assisted Multi-Scale Transformer for Object-Level Building Extraction from Satellite Remote Sensing Imagery" Electronics 15, no. 6: 1301. https://doi.org/10.3390/electronics15061301

APA Style

Li, S., Wang, H., Yao, J., Wu, Z., & Chen, Z. (2026). A Boundary-Assisted Multi-Scale Transformer for Object-Level Building Extraction from Satellite Remote Sensing Imagery. Electronics, 15(6), 1301. https://doi.org/10.3390/electronics15061301

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Boundary-Assisted Multi-Scale Transformer for Object-Level Building Extraction from Satellite Remote Sensing Imagery

Abstract

1. Introduction

2. Data and Method

2.1. Data

2.2. Method

2.2.1. Overview of MRLNet Architecture

2.2.2. Transformer as Region Encoder

2.2.3. Multiscale Pixel Region Geometrics

2.2.4. Boundary Assist Loss

2.2.5. Verification Metrics

3. Experimental Results

3.1. Comparative Experiment

3.2. Cross-Dataset Validation on Massachusetts Buildings Dataset

3.3. Ablation Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI