Multi-Scale Guided Context-Aware Transformer for Remote Sensing Building Extraction
Abstract
1. Introduction
- A hierarchical receptive field expansion mechanism based on asymmetric convolutions and progressive dilated convolutions is proposed, which synergistically integrates dynamic multi-scale feature fusion with residual-guided optimization to significantly enhance contextual modeling capabilities in dense prediction tasks;
- We propose a dynamic multi-scale feature fusion method using hierarchical attention for adaptive cross-scale aggregation, preserving local geometry and balancing global–local contexts. Combined into a Transformer-based decoder, it employs deformable convolutions for spatially adaptive feature alignment;
2. Related Work
3. Methodology
3.1. Overview
3.2. Contextual Exploration Module
- The base path employs 1×1 pointwise convolution for channel transformation
- The primary expansion path adopts a four-stage cascade structure: 1 × 1 channel compression convolution, followed by 1 × 3 and 3 × 1 asymmetric convolution pairs, and finally a 3 × 3 dilated convolution with rate 3. This yields output feature with an effective 7 × 7 receptive field.
- The intermediate expansion path extends the primary path by using 1 × 5 and 5 × 1 asymmetric convolution pairs combined with a rate-5 3 × 3 dilated convolution, producing output feature with a 19 × 19 receptive field.
- The advanced expansion path further implements 1 × 7 and 7 × 1 asymmetric convolution combinations coupled with a rate-7 3 × 3 dilated convolution, expanding the receptive field of output feature to 31 × 31 pixels, specifically optimized for capturing global context of large-scale targets.
3.3. Window-Guided Multi-Scale Attention Mechanism
3.4. Cross-Level Transformer Decoder
4. Experiment
4.1. Dataset
- The Massachusetts Buildings Dataset [57] encompasses 151 aerial images of the Boston region. Each image measures pixels, covering km2 at 1-meter spatial resolution, with total coverage approximating 340 km2. Original partitioning designates 137 images for training, 10 for testing, and 4 for validation. Experimental preprocessing involved random cropping of images and corresponding labels to pixel patches during training, while validation and testing employed pixel padding to ensure 32-divisibility. Padded regions were systematically omitted from evaluation metrics to preserve assessment accuracy (as illustrated in Figure 5a).
- The WHU Building Dataset [56] incorporates both aerial and satellite imagery subsets. This investigation specifically focuses on the aerial imagery subset acquired in Christchurch, New Zealand. Spanning 450 square kilometers, the subset contains annotations for over 220,000 distinct buildings extracted from source imagery with -m spatial resolution. The processed dataset comprises 8189 image tiles at -m resolution, allocated as follows: 4736 for training, 1036 for validation, and 2416 for testing (as illustrated in Figure 5b).
- The Inria Building Dataset [58] incorporates 360 orthorectified color aerial images at -m spatial resolution, encompassing five representative urban zones in the United States (Austin, Chicago, Kitsap) and Europe (Tyrol, Vienna) with aggregate coverage of 810 km2 (equally allocated as 405 km2 per training and test set). Following official partitioning protocols, this investigation employed stratified sampling by randomly designating 1 to 5 images per city for validation while allocating residual images for training. Data preprocessing was initiated with zero-padding of original pixel images to pixels, subsequently segmented into standardized pixel patches. Post rigorous quality control eliminated non-building specimens, and the refined dataset contained 9737 training samples and 1942 validation samples (as illustrated in Figure 5c).
4.2. Evaluation Metrics
4.3. Experimental Settings
4.4. Compared Methods
4.5. Evaluation on Massachusetts Building Dataset
- Quantitative Comparison: As shown in Table 1, MSGCANet demonstrates excellent building extraction performance on the Massachusetts dataset. Among the key evaluation metrics, MSGCANet surpasses all compared methods in IoU, F1-score, and Precision, achieving 75.47%, 86.03%, and 87.55%, respectively. Its Recall reaches 84.50%, showing a clear overall advantage and reflecting the effectiveness of the multi-scale collaborative mechanism in building extraction tasks.
- Visual Comparison: Figure 6 presents three sets of visual comparisons for building extraction. As shown, in the first row, other methods miss buildings within the red-boxed regions and produce incomplete building contours, whereas MSGCANet accurately and completely extracts the building outlines, closely matching the ground truth. In the second row, the T-shaped building in the red box is incompletely captured by other methods, while MSGCANet produces contours nearly identical to the ground truth. In the third row, for the elongated buildings within the red-boxed area, BOMSC-Net and CLGFF-Net exhibit over-detection errors, and BuildFormer and DFF-Net miss certain buildings. Only MSGCANet successfully extracts the building group accurately and completely, with minimal deviation from the ground truth. These results demonstrate MSGCANet’s superiority in preserving structural details and maintaining complete building contours.
4.6. Evaluation on WHU Building Dataset
- Quantitative Comparison: As shown in Table 2, MSGCANet demonstrates outstanding building extraction performance on the WHU dataset. In several key evaluation metrics, MSGCANet outperforms all compared methods, achieving an IoU of 91.53%, an F1-score of 95.59%, and a Precision of 95.65%. Additionally, with a Recall of 95.46%, MSGCANet shows a clear overall advantage, highlighting the effectiveness of the multi-scale collaborative mechanism in building extraction tasks.
- Visual Comparison: To more intuitively demonstrate the advantages of MSGCANet, Figure 7 presents the building extraction results of various comparative methods. In the first image, the building region in the lower-left corner, highlighted by a red box, suffers from missed detections in all other methods, with incomplete or entirely undetected contours. MSGCANet, however, successfully extracts complete and accurate building contours, showing high consistency with the ground truth. In the second image, the three small buildings highlighted by a red box are incompletely detected by other methods, while MSGCANet accurately captures both the number and contours of the buildings, detecting all three completely. In the third image, the dense building cluster in the lower-right corner is partially missed by other methods, resulting in incomplete extraction, whereas MSGCANet achieves complete extraction with results closely aligned with the ground truth, demonstrating very high accuracy.
4.7. Evaluation on Inria Building Dataset
- Quantitative Comparison: Table 3 presents a performance comparison of various methods on the Inria dataset. MSGCANet outperforms all other methods across all key metrics, achieving an IoU of 83.10%, an F1-score of 90.78%, a Precision of 91.98%, and a Recall of 89.55%. This demonstrates its clear overall advantage and highlights the effectiveness of the multi-scale collaborative mechanism in building extraction tasks.
- Visual Comparison: Figure 8 presents three representative cases. In the first image, the building highlighted by the red box is incompletely detected by all other methods, whereas MSGCANet successfully extracts the full building contour with high accuracy, closely matching the ground truth. In the second image, the buildings within the red box show noticeable contour deviations and shape distortions in other methods, while MSGCANet achieves the most accurate and faithful representation of the building shapes. In the third image, the small building in the lower-left corner of the red box is not perfectly detected by any method, including ours; however, the large building on the right is extracted with the most complete and precise contours by MSGCANet, demonstrating overall superior performance compared to the other methods.
5. Discussion
5.1. Effectiveness of Contextual Exploration Module
5.2. Effectiveness of Window-Guided Multi-Scale Attention Mechanism
5.3. Analysis About the Windows of Window-Guided Multi-Scale Attention Mechanism
5.4. Limitations and Future Work
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Rathore, M.M.; Ahmad, A.; Paul, A.; Rho, S. Urban planning and building smart cities based on the Internet of Things using Big Data analytics. Comput. Netw. 2016, 101, 63–80. [Google Scholar] [CrossRef]
- Xie, Y.; Weng, A.; Weng, Q. Population Estimation of Urban Residential Communities Using Remotely Sensed Morphologic Data. IEEE Geosci. Remote. Sens. Lett. 2015, 12, 1111–1115. [Google Scholar] [CrossRef]
- Wang, H.; Wei, Y.; Liu, Y.; Cao, Y.; Liu, R.; Ning, X. Evaluation of Chinese Urban Land-Use Efficiency (Sdg11.3.1) Based on High-Precision Urban Built-up Area Data. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 858–862. [Google Scholar] [CrossRef]
- Wu, F.; Wang, C.; Zhang, B.; Zhang, H.; Gong, L. Discrimination of Collapsed Buildings from Remote Sensing Imagery Using Deep Neural Networks. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 2646–2649. [Google Scholar] [CrossRef]
- Sasmoko.; Wijaksono, S.; Indrianti, Y.; Rahmayati, Y. Empirical Study on the Effect of Green Building and Risk Management on Economic Quality and Sustainability in the Indonesian Sustainable Architecture Index. In Proceedings of the 2024 International Conference on ICT for Smart Society (ICISS), Yogyakarta, Indonesia, 4–5 September 2024; pp. 1–5. [Google Scholar] [CrossRef]
- Shackelford, A.; Davis, C.; Wang, X. Automated 2-D building footprint extraction from high-resolution satellite multispectral imagery. In Proceedings of the IGARSS 2004—2004 IEEE International Geoscience and Remote Sensing Symposium, Anchorage, AK, USA, 20–24 September 2004; Volume 3, pp. 1996–1999. [Google Scholar] [CrossRef]
- Krishnamachari, S.; Chellappa, R. An energy minimization approach to building detection in aerial images. In Proceedings of the ICASSP ’94. IEEE International Conference on Acoustics, Speech and Signal Processing, Adelaide, SA, Australia, 19–22 April 1994; Volume 5, pp. V/13–V/16. [Google Scholar] [CrossRef]
- Jung, C.; Schramm, R. Rectangle detection based on a windowed Hough transform. In Proceedings of the 17th Brazilian Symposium on Computer Graphics and Image Processing, Curitiba, Brazil, 20 October 2004; pp. 113–120. [Google Scholar] [CrossRef]
- Irvin, R.; McKeown, D. Methods for exploiting the relationship between buildings and their shadows in aerial imagery. IEEE Trans. Syst. Man Cybern. 1989, 19, 1564–1575. [Google Scholar] [CrossRef]
- Huang, X.; Zhang, L. Morphological Building/Shadow Index for Building Extraction From High-Resolution Imagery Over Urban Areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 161–172. [Google Scholar] [CrossRef]
- Tang, L.; Xie, W.; Hang, J. Automatic high-rise building extraction from aerial images. In Proceedings of the Fifth World Congress on Intelligent Control and Automation (IEEE Cat. No.04EX788), Hangzhou, China, 15–19 June 2004; Volume 4, pp. 3109–3113. [Google Scholar] [CrossRef]
- Li, W.; Liu, H.; Wang, Y.; Li, Z.; Jia, Y.; Gui, G. Deep Learning-Based Classification Methods for Remote Sensing Images in Urban Built-Up Areas. IEEE Access 2019, 7, 36274–36284. [Google Scholar] [CrossRef]
- O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv 2015, arXiv:1511.08458. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
- Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
- Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
- Jia, H.; Yang, W.; Wang, L.; Li, H. Uncertainty-Guided Segmentation Network for Geospatial Object Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5824–5833. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. arXiv 2017, arXiv:1612.01105. [Google Scholar] [CrossRef]
- Chen, Y.; Cheng, H.; Yao, S.; Hu, Z. Building extraction from high-resolution remote sensing imagery based on multi-scale feature fusion and enhancement. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, XLIII-B3-2022, 55–60. [Google Scholar] [CrossRef]
- Liu, Y.; Zhao, Z.; Zhang, S.; Huang, L. Multiregion Scale-Aware Network for Building Extraction From High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–10. [Google Scholar] [CrossRef]
- Zhou, Y.; Chen, Z.; Wang, B.; Li, S.; Liu, H.; Xu, D.; Ma, C. BOMSC-Net: Boundary Optimization and Multi-Scale Context Awareness Based Building Extraction From High-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
- Chen, X.; Xiao, P.; Zhang, X.; Muhtar, D.; Wang, L. A Cascaded Network with Coupled High-Low Frequency Features for Building Extraction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 10390–10406. [Google Scholar] [CrossRef]
- Chen, J.; Liu, B.; Yu, A.; Quan, Y.; Li, T.; Guo, W. Depth Feature Fusion Network for Building Extraction in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 16577–16591. [Google Scholar] [CrossRef]
- Sultonov, F.; Yun, S.; Kang, J.M. DASK-Net: A Lightweight Dual-Attention Selective Kernel Network for Efficient Dense Prediction in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–16. [Google Scholar] [CrossRef]
- Li, J.; Wei, Y.; Wei, T.; He, W. A Comprehensive Deep-Learning Framework for Fine-Grained Farmland Mapping from High-Resolution Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3141–3149. [Google Scholar] [CrossRef]
- Guo, H.; Shi, Q.; Du, B.; Zhang, L.; Wang, D.; Ding, H. Scene-Driven Multitask Parallel Attention Network for Building Extraction in High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 4287–4306. [Google Scholar] [CrossRef]
- Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency Channel Attention Networks. arXiv 2021, arXiv:2012.11879. [Google Scholar] [CrossRef]
- Das, P.; Chand, S. AttentionBuildNet for Building Extraction from Aerial Imagery. In Proceedings of the 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India, 19–20 February 2021; pp. 576–580. [Google Scholar] [CrossRef]
- Wang, L.; Fang, S.; Meng, X.; Li, R. Building Extraction with Vision Transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
- Zhai, Y.; Li, W.; Xian, T.; Jia, X.; Zhang, H.; Tan, Z.; Zhou, J.; Zeng, J.; Philip Chen, C.L. CAS-Net: Comparison-Based Attention Siamese Network for Change Detection with an Open High-Resolution UAV Image Dataset. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
- Fu, W.; Xie, K.; Fang, L. Complementarity-Aware Local–Global Feature Fusion Network for Building Extraction in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5617113. [Google Scholar] [CrossRef]
- de Oliveira Junior, L.A.; Medeiros, H.R.; Macêdo, D.; Zanchettin, C.; Oliveira, A.L.I.; Ludermir, T. SegNetRes-CRF: A Deep Convolutional Encoder-Decoder Architecture for Semantic Image Segmentation. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Liu, Y.; Fan, B.; Wang, L.; Bai, J.; Xiang, S.; Pan, C. Semantic labeling in very high resolution images via a self-cascaded convolutional neural network. ISPRS J. Photogramm. Remote Sens. 2018, 145, 78–95. [Google Scholar] [CrossRef]
- Liu, Y.; Chen, D.; Ma, A.; Zhong, Y.; Fang, F.; Xu, K. Multiscale U-Shaped CNN Building Instance Extraction Framework with Edge Constraint for High-Spatial-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6106–6120. [Google Scholar] [CrossRef]
- Han, T.; Ma, J.; Wang, C.; Luo, Y.; Fan, H.; Marcato, J.; Zhang, X.; Chen, Y. CityInsight: Incorporating Dual-Condition based Diffusion Model into Building Footprint Segmentation from Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63. [Google Scholar] [CrossRef]
- Jung, H.; Choi, H.S.; Kang, M. Boundary Enhancement Semantic Segmentation for Building Extraction from Remote Sensed Image. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
- Cao, S.; Feng, D.; Liu, S.; Xu, W.; Chen, H.; Xie, Y.; Zhang, H.; Pirasteh, S.; Zhu, J. BEMRF-Net: Boundary Enhancement and Multiscale Refinement Fusion for Building Extraction from Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 16342–16358. [Google Scholar] [CrossRef]
- Zhu, X.; Zhang, X.; Zhang, T.; Tang, X.; Chen, P.; Zhou, H.; Jiao, L. Semantics and Contour Based Interactive Learning Network for Building Footprint Extraction. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
- Tang, S.; Wang, X.; Pan, C.; Ji, R.; Zhou, C.; Tan, K. Poly BRBLE: A Boundary Refinement-Based Individual Building Localization and Extraction Model Combined with Regularization. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–17. [Google Scholar] [CrossRef]
- Li, X.; Liu, Z.; Luo, P.; Loy, C.C.; Tang, X. Not All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer Cascade. arXiv 2017, arXiv:1704.01344. [Google Scholar] [CrossRef]
- Jing, L.; Chen, Y.; Tian, Y. Coarse-to-Fine Semantic Segmentation from Image-Level Labels. IEEE Trans. Image Process. 2020, 29, 225–236. [Google Scholar] [CrossRef]
- Guo, H.; Du, B.; Zhang, L.; Su, X. A coarse-to-fine boundary refinement network for building footprint extraction from remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 183, 240–252. [Google Scholar] [CrossRef]
- Sheikh, M.A.A.; Maity, T.; Kole, A. IRU-Net: An Efficient End-to-End Network for Automatic Building Extraction from Remote Sensing Images. IEEE Access 2022, 10, 37811–37828. [Google Scholar] [CrossRef]
- Liu, Z.; Shi, Q.; Ou, J. LCS: A Collaborative Optimization Framework of Vector Extraction and Semantic Segmentation for Building Extraction. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
- Li, J.; He, W.; Li, Z.; Guo, Y.; Zhang, H. Overcoming the uncertainty challenges in detecting building changes from remote sensing images. ISPRS J. Photogramm. Remote Sens. 2025, 220, 1–17. [Google Scholar] [CrossRef]
- Li, J.; He, W.; Cao, W.; Zhang, L.; Zhang, H. UANet: An Uncertainty-Aware Network for Building Extraction from Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5608513. [Google Scholar] [CrossRef]
- Huo, Y.; Gang, S.; Guan, C. FCIHMRT: Feature Cross-Layer Interaction Hybrid Method Based on Res2Net and Transformer for Remote Sensing Scene Classification. Electronics 2023, 12, 4362. [Google Scholar] [CrossRef]
- Chen, B.; Zou, X.; Zhang, Y.; Li, J.; Li, K.; Xing, J.; Tao, P. LEFormer: A Hybrid CNN-Transformer Architecture for Accurate Lake Extraction from Remote Sensing Imagery. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 5710–5714. [Google Scholar] [CrossRef]
- Gibril, M.B.A.; Al-Ruzouq, R.; Bolcek, J.; Shanableh, A.; Jena, R. Building Extraction from Satellite Images Using Mask R-CNN and Swin Transformer. In Proceedings of the 2024 34th International Conference Radioelektronika (RADIOELEKTRONIKA), Zilina, Slovakia, 17–18 April 2024; pp. 1–5. [Google Scholar] [CrossRef]
- Patel, S. Hybrid CNN-Transformer for Aerial Object Detection: A Novel Architecture for Enhanced Detection Accuracy. In Proceedings of the 2025 International Conference on Machine Learning and Autonomous Systems (ICMLAS), Prawet, Thailand, 10–12 March 2025; pp. 693–698. [Google Scholar] [CrossRef]
- Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction from an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
- Mnih, V. Machine Learning for Aerial Image Labeling. Ph.D. Thesis, University of Toronto, Toronto, ON, Canada, 2013. [Google Scholar]
- Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can Semantic Labeling Methods Generalize to Any City? The Inria Aerial Image Labeling Benchmark. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
- Sung, K.K.; Poggio, T. Example-based learning for view-based human face detection. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 39–51. [Google Scholar] [CrossRef]
- Joachims, T. Text categorization with Support Vector Machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, ECML’98, Berlin/Heidelberg, Germany, 21–23 April 1998; pp. 137–142. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101. [Google Scholar] [CrossRef]
Method | Year | IoU(%) | F1(%) | Pre(%) | Rec(%) |
---|---|---|---|---|---|
Deeplab v3+ | 2018 | 69.90 | 82.28 | 83.81 | 80.81 |
CBRNet | 2021 | 74.55 | 85.42 | 86.50 | 84.36 |
BuildFormer | 2022 | 75.03 | 85.73 | 86.69 | 84.79 |
BOMSC-Net | 2022 | 74.71 | 85.13 | 86.64 | 83.68 |
CLGFF-Net | 2024 | 75.33 | 85.93 | 85.03 | 86.85 |
DFF-Net | 2024 | 72.60 | 84.20 | 87.20 | 81.30 |
CICF-Net | 2024 | 75.17 | 85.83 | - | - |
MSGCANet | - | 75.47 ± 0.014 | 86.03 ± 0.012 | 87.55 ± 0.015 | 84.50 ± 0.013 |
Method | Year | IoU(%) | F1(%) | Pre(%) | Rec(%) |
---|---|---|---|---|---|
Deeplab v3+ | 2018 | 86.63 | 93.39 | 92.91 | 93.88 |
CBRNet | 2021 | 91.40 | 95.51 | 95.31 | 95.70 |
BuildFormer | 2022 | 90.73 | 95.14 | 95.15 | 95.14 |
BOMSC-Net | 2022 | 90.15 | 94.80 | 95.14 | 94.50 |
CLGFF-Net | 2024 | 91.30 | 95.45 | 95.01 | 95.89 |
DFF-Net | 2024 | 90.50 | 95.00 | 95.40 | 94.60 |
CICF-Net | 2024 | 91.45 | 95.53 | - | - |
MSGCANet | - | 91.53 ± 0.013 | 95.59 ± 0.012 | 95.65 ± 0.015 | 95.46 ± 0.014 |
Method | Year | IoU(%) | F1(%) | Pre(%) | Rec(%) |
---|---|---|---|---|---|
Deeplab v3+ | 2018 | 76.80 | 86.88 | 87.35 | 86.40 |
CBRNet | 2021 | 81.10 | 89.56 | 89.93 | 89.20 |
BuildFormer | 2022 | 81.24 | 89.71 | 90.65 | 88.78 |
BOMSC-Net | 2022 | 78.18 | 87.75 | 87.93 | 87.58 |
CLGFF-Net | 2024 | 82.48 | 90.40 | 91.86 | 88.99 |
DFF-Net | 2024 | 77.90 | 87.60 | 88.80 | 86.30 |
CICF-Net | 2024 | 81.28 | 89.67 | - | - |
MSGCANet | - | 83.10 ± 0.015 | 90.78 ± 0.012 | 91.98 ± 0.017 | 89.55 ± 0.013 |
Config | WHU | Mass | Inria | |||||
---|---|---|---|---|---|---|---|---|
A | B | C | ||||||
✓ | 88.40 ± 0.021 | 93.97 ± 0.018 | 72.51 ± 0.025 | 84.13 ± 0.017 | 78.42 ± 0.019 | 87.95 ± 0.022 | ||
✓ | ✓ | 90.61 ± 0.020 | 95.08 ± 0.017 | 73.75 ± 0.019 | 85.25 ± 0.023 | 80.40 ± 0.015 | 89.06 ± 0.027 | |
✓ | ✓ | 91.18 ± 0.026 | 95.36 ± 0.022 | 74.66 ± 0.018 | 85.07 ± 0.026 | 82.19 ± 0.021 | 90.11 ± 0.025 | |
✓ | ✓ | ✓ | 91.55 ± 0.016 | 95.59 ± 0.018 | 75.45 ± 0.027 | 86.03 ± 0.020 | 83.11 ± 0.015 | 90.77 ± 0.017 |
Scales | WHU | Mass | Inria | |||
---|---|---|---|---|---|---|
2 | ||||||
4 | ||||||
8 | ||||||
2, 4 | ||||||
2, 8 | ||||||
4, 8 | ||||||
2, 4, 8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, M.; Li, J.; He, W. Multi-Scale Guided Context-Aware Transformer for Remote Sensing Building Extraction. Sensors 2025, 25, 5356. https://doi.org/10.3390/s25175356
Yu M, Li J, He W. Multi-Scale Guided Context-Aware Transformer for Remote Sensing Building Extraction. Sensors. 2025; 25(17):5356. https://doi.org/10.3390/s25175356
Chicago/Turabian StyleYu, Mengxuan, Jiepan Li, and Wei He. 2025. "Multi-Scale Guided Context-Aware Transformer for Remote Sensing Building Extraction" Sensors 25, no. 17: 5356. https://doi.org/10.3390/s25175356
APA StyleYu, M., Li, J., & He, W. (2025). Multi-Scale Guided Context-Aware Transformer for Remote Sensing Building Extraction. Sensors, 25(17), 5356. https://doi.org/10.3390/s25175356