PolyReg: Autoregressive Building Outline Regularization via Masked Attention Sequence Generation
Abstract
:1. Introduction
- (1)
- Proposing a new two-stage building outline extraction method that combines the feature extraction capabilities of Segformer with the sequence generation capabilities of Transformer models.
- (2)
- Drawing inspiration from UNILM and large language models to construct the PolyReg model, which is based on the Transformer encoder and achieves sequence-to-sequence generation tasks through cleverly designed self-attention mask matrices. This model can directly output building outline coordinates in an auto-regressive manner without cumbersome post-processing steps.
- (3)
- Introducing the advantages of sequence generation models to the building outline extraction task. Compared to currently popular methods that regress endpoints and corner point coordinates, the proposed method is not limited to predicting fixed point sets or predefined vertices. By modeling the entire contour through sequence generation, it can flexibly generate a variable number of contour points according to requirements, better adapting to buildings of different shapes.
2. Materials and Methods
2.1. Model Architecture
2.2. SegFormer
2.3. PolyReg
2.3.1. PolyReg Network Architecture
2.3.2. Mask-Attention
2.3.3. Pre-Training Task Design
3. Experiments
3.1. Dataset
3.2. Model Parameter Settings
3.3. Evaluation Metrics
3.4. Experiments and Analysis
4. Discussion
4.1. Comparative Analysis
4.2. Impact of Different Backbones on PolyReg
4.3. Limitations
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, Z.; Wegner, J.D.; Lucchi, A. Topological map extraction from overhead images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1715–1724. [Google Scholar]
- Girard, N.; Smirnov, D.; Solomon, J.; Tarabalka, Y. Polygonal Building Extraction by Frame Field Learning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 5887–5896. [Google Scholar]
- Zorzi, S.; Bittner, K.; Fraundorfer, F. Machine-learned Regularization and Polygonization of Building Segmentation Masks. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3098–3105. [Google Scholar]
- Yang, H.L.; Yuan, J.; Lunga, D.; Laverdiere, M.; Rose, A.; Bhaduri, B. Building extraction at scale using convolutional neural network: Mapping of the united states. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 2600–2614. [Google Scholar] [CrossRef]
- Zhao, K.; Kang, J.; Jung, J.; Sohn, G. Building extraction from satellite images using mask R-CNN with building boundary regularization. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 247–251. [Google Scholar]
- Wei, S.; Ji, S.; Lu, M. Toward automatic building footprint delineation from aerial images using CNN and regularization. IEEE Trans. Geosci. Remote Sens. 2019, 58, 2178–2189. [Google Scholar] [CrossRef]
- Wang, W.; Du, J.; Li, X.; Hu, H.; Xu, W.; Guo, H.; Ding, Y. A grid filling based rectangular building outlines regularization method. Geomat. Inf. Sci. Wuhan Univ. 2018, 43, 318–324. [Google Scholar] [CrossRef]
- Yazhou, D.; Fajie, F.; Junping, L.; Yan, H.; Weihong, C. Right-angle buildings extraction from high-resolution aerial image based on multi-stars constraint segmentation and regularization. Acta Geod. Cartogr. Sin. 2018, 47, 1630. [Google Scholar]
- Douglas, D.H.; Peucker, T.K. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica 1973, 10, 112–122. [Google Scholar] [CrossRef]
- Xiang, H.; Jianhua, W.; Ning, W.; Haowen, X. Building contour optimization method for multi-source data. Acta Opt. Sin. 2023, 43, 1228012. [Google Scholar] [CrossRef]
- Mousa, Y.A.; Helmholz, P.; Belton, D.; Bulatov, D. Building detection and regularisation using DSM and imagery information. Photogramm. Rec. 2019, 34, 85–107. [Google Scholar] [CrossRef]
- Yunfan, L.; Gong, W.; Lin, Y.; Wang, B. The extraction of building boundaries based on LiDAR point cloud data and imageries. Remote Sens. Land Resour. 2014, 26, 54–59. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. ISBN 978-3-319-24573-7. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference On Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference On Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Jung, J.; Jwa, Y.; Sohn, G. Implicit regularization for reconstructing 3D building rooftop models using airborne LiDAR data. Sensors 2017, 17, 621. [Google Scholar] [CrossRef] [PubMed]
- Li, M.; Lafarge, F.; Marlet, R. Approximating shapes in images with low-complexity polygons. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8633–8641. [Google Scholar]
- Girard, N.; Tarabalka, Y. End-to-end learning of polygons for remote sensing image classification. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience And Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 2083–2086. [Google Scholar]
- Zorzi, S.; Bazrafkan, S.; Habenschuss, S.; Fraundorfer, F. Polyworld: Polygonal building extraction with graph neural networks in satellite images. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1848–1857. [Google Scholar]
- Castrejon, L.; Kundu, K.; Urtasun, R.; Fidler, S. Annotating object instances with a polygon-rnn. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5230–5238. [Google Scholar]
- Zhao, W.; Persello, C.; Stein, A. Building outline delineation: From aerial images to polygons with an improved end-to-end learning framework. ISPRS J. Photogramm. Remote Sens. 2021, 175, 119–131. [Google Scholar] [CrossRef]
- Ling, H.; Gao, J.; Kar, A.; Chen, W.; Fidler, S. Fast interactive object annotation with curve-gcn. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5257–5266. [Google Scholar]
- Wei, S.; Ji, S. Graph convolutional networks for the automated production of building vector maps from aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
- Chen, Q.; Wang, L.; Waslander, S.L.; Liu, X. An end-to-end shape modeling framework for vectorized building outline generation from aerial images. ISPRS J. Photogramm. Remote Sens. 2020, 170, 114–126. [Google Scholar] [CrossRef]
- Peng, S.; Jiang, W.; Pi, H.; Li, X.; Bao, H.; Zhou, X. Deep snake for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8533–8542. [Google Scholar]
- Xie, E.; Sun, P.; Song, X.; Wang, W.; Liu, X.; Liang, D.; Shen, C.; Luo, P. Polarmask: Single shot instance segmentation with polar representation. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 12193–12202. [Google Scholar]
- Liu, Z.; Liew, J.H.; Chen, X.; Feng, J. Dance: A deep attentive contour model for efficient instance segmentation. In Proceedings of the IEEE/CVF Winter Conference On Applications Of Computer Vision, Waikoloa, HI, USA, 5–9 January 2021; pp. 345–354. [Google Scholar]
- Wei, S.; Zhang, T.; Ji, S.; Luo, M.; Gong, J. BuildMapper: A fully learnable framework for vectorized building contour extraction. ISPRS J. Photogramm. Remote Sens. 2023, 197, 87–104. [Google Scholar] [CrossRef]
- Hu, Y.; Wang, Z.; Huang, Z.; Liu, Y. PolyBuilding: Polygon transformer for building extraction. ISPRS J. Photogramm. Remote Sens. 2023, 199, 15–27. [Google Scholar] [CrossRef]
- Zhang, T.; Wei, S.; Ji, S. E2ec: An end-to-end contour-based method for high-quality high-speed instance segmentation. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4443–4452. [Google Scholar]
- Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H.-W. Unified language model pre-training for natural language understanding and generation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 13063–13067. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Suzuki, S. Topological structural analysis of digitized binary images by border following. Comput. Vis. Graph. Image Process. 1985, 30, 32–46. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
- Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? The inria aerial image labeling benchmark. In Proceedings of the IEEE International Geoscience And Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017. [Google Scholar]
- Xu, B.; Xu, J.; Xue, N.; Xia, G.-S. HiSup: Accurate polygonal mapping of buildings in satellite imagery with hierarchical supervision. ISPRS J. Photogramm. Remote Sens. 2023, 198, 284–296. [Google Scholar] [CrossRef]
- Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning In Medical Image Analysis And Multimodal Learning For Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; proceedings 4. Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
- Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Dataset | City | Total Area | Total Buildings |
---|---|---|---|
Train | Austin, TX, USA | 52,275 | |
Kitsap County, WA, USA | 24,066 | ||
Vienna, Austria | 32,229 | ||
Test | Chicago, IL, USA | 80,652 | |
West Tyrol, Austria | 17,458 |
City | Ways | IoU | HD | ACR | PCR | SRP | N-Ratio | C-IoU |
---|---|---|---|---|---|---|---|---|
Chicago | Segformer | 0.67 | 1.20 | 0.79 | 1.92 | 0.88 | 2.98 | 0.57 |
Segformer + PolyReg | 0.80 | 0.60 | 0.88 | 1.03 | 0.63 | 0.63 | 0.71 | |
West Tyrol | Segformer | 0.60 | 1.78 | 0.69 | 1.91 | 0.90 | 2.57 | 0.43 |
Segformer + PolyReg | 0.79 | 0.58 | 0.80 | 0.83 | 0.55 | 0.51 | 0.57 |
City | Ways | F1-Score | IoU | HD | ACR | PCR | SRP | C-IoU |
---|---|---|---|---|---|---|---|---|
Chicago | CRM | 0.75 | 0.58 | 0.72 | 0.67 | 1.02 | 0.05 | 0.54 |
MLR | 0.34 | 0.25 | 2.89 | 0.58 | 3.34 | 0.69 | 0.13 | |
Hisup | 0.70 | 0.58 | 1.83 | 0.65 | 1.53 | 0.45 | 0.48 | |
Segformer + PolyReg | 0.89 | 0.80 | 0.59 | 0.88 | 1.03 | 0.63 | 0.71 | |
West Tyrol | CRM | 0.79 | 0.68 | 1.45 | 0.76 | 0.62 | 0.12 | 0.49 |
MLR | 0.58 | 0.45 | 3.27 | 0.53 | 1.76 | 0.67 | 0.27 | |
Hisup | 0.83 | 0.75 | 0.57 | 0.83 | 0.83 | 0.52 | 0.54 | |
Segformer + PolyReg | 0.85 | 0.79 | 0.58 | 0.80 | 0.83 | 0.55 | 0.57 |
Method | Average Inference Time (Seconds per Image Tile) |
---|---|
CRM | 0.44 |
MLR | 0.89 |
Hisup | 0.70 |
Segformer + PolyReg | 0.98 |
City | Ways | P | R | IoU | HD | ACR | PCR | SRP | N-Ratio | C-IoU |
---|---|---|---|---|---|---|---|---|---|---|
Chicago | U-Net | 0.773 | 0.831 | 0.568 | 1.641 | 0.701 | 1.574 | 0.755 | 3.244 | 0.512 |
U-Net + PolyReg | 0.922 | 0.856 | 0.799 | 0.632 | 0.873 | 0.995 | 0.588 | 0.697 | 0.638 | |
U-Net++ | 0.812 | 0.808 | 0.639 | 1.399 | 0.731 | 1.691 | 0.721 | 3.098 | 0.537 | |
U-Net++ + PolyReg | 0.921 | 0.849 | 0.791 | 0.755 | 0.871 | 1.009 | 0.463 | 0.579 | 0.676 | |
DeepLabv3+ | 0.815 | 0.838 | 0.690 | 1.029 | 0.797 | 1.878 | 0.782 | 2.976 | 0.594 | |
DeepLabv3+ + PolyReg | 0.928 | 0.839 | 0.788 | 0.777 | 0.863 | 0.996 | 0.435 | 0.791 | 0.694 | |
Segformer | 0.877 | 0.781 | 0.673 | 1.197 | 0.785 | 1.917 | 0.883 | 2.978 | 0.571 | |
Segformer + PolyReg | 0.920 | 0.862 | 0.797 | 0.594 | 0.882 | 1.029 | 0.628 | 0.625 | 0.706 | |
West Tyrol | U-Net | 0.888 | 0.656 | 0.553 | 1.927 | 0.631 | 1.958 | 0.937 | 3.135 | 0.411 |
U-Net + PolyReg | 0.941 | 0.808 | 0.764 | 0.983 | 0.805 | 0.990 | 0.496 | 0.434 | 0.543 | |
U-Net++ | 0.770 | 0.821 | 0.613 | 2.224 | 0.711 | 1.9235 | 0.825 | 4.111 | 0.299 | |
U-Net++ + PolyReg | 0.947 | 0.791 | 0.755 | 1.075 | 0.798 | 0.989 | 0.562 | 0.434 | 0.543 | |
DeepLabv3+ | 0.810 | 0.612 | 0.587 | 1.878 | 0.671 | 2.192 | 0.750 | 3.058 | 0.414 | |
DeepLabv3+ + PolyReg | 0.942 | 0.800 | 0.758 | 1.047 | 0.795 | 0.986 | 0.463 | 0.377 | 0.539 | |
Segformer | 0.851 | 0.677 | 0.599 | 1.784 | 0.685 | 1.905 | 0.900 | 2.565 | 0.431 | |
Segformer + PolyReg | 0.875 | 0.826 | 0.793 | 0.582 | 0.801 | 0.827 | 0.545 | 0.506 | 0.574 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cui, L.; Li, C.; Chen, X.; Wang, X.; Qian, H. PolyReg: Autoregressive Building Outline Regularization via Masked Attention Sequence Generation. Remote Sens. 2025, 17, 1650. https://doi.org/10.3390/rs17091650
Cui L, Li C, Chen X, Wang X, Qian H. PolyReg: Autoregressive Building Outline Regularization via Masked Attention Sequence Generation. Remote Sensing. 2025; 17(9):1650. https://doi.org/10.3390/rs17091650
Chicago/Turabian StyleCui, Longfei, Chao Li, Xin Chen, Xiao Wang, and Haizhong Qian. 2025. "PolyReg: Autoregressive Building Outline Regularization via Masked Attention Sequence Generation" Remote Sensing 17, no. 9: 1650. https://doi.org/10.3390/rs17091650
APA StyleCui, L., Li, C., Chen, X., Wang, X., & Qian, H. (2025). PolyReg: Autoregressive Building Outline Regularization via Masked Attention Sequence Generation. Remote Sensing, 17(9), 1650. https://doi.org/10.3390/rs17091650