# Building Polygon Extraction from Aerial Images and Digital Surface Models with a Frame Field Learning Framework

^{*}

## Abstract

**:**

## 1. Introduction

- We introduce the nDSM and near-infrared image into the deep learning model, using the fusion of images and 3D information to optimize information extraction in the building segmentation process.
- We evaluate the performance of the considered methods, adopting different metrics to assess the results at the pixel, object, and polygon levels. We analyze the deviations in the number of vertices per building extracted by the proposed methods compared with the reference polygons.
- We have constructed a new building dataset for the city of Enschede, The Netherlands, to promote further research in this field. The data will be published with the following DOI: 10.17026/dans-248-n3nd.

## 2. Proposed Method

- First, an initial contour is produced from the segmentation;
- Then, the contour is iteratively adjusted with the constraints of the frame field;
- With the direction information of the frame field, the corners are identified from other vertices and further preserved in the simplification.

#### 2.1. Frame Field Learning

#### 2.2. Polygonization

#### 2.3. Loss Function

## 3. Dataset

#### 3.1. Dataset

^{2}. The dataset of the study area was acquired in 2019. The DTM and DSM were derived from point cloud data based on the squared IDW method with 0.5 m resolution. The LiDAR point clouds and DSM are shown in Figure 4. (3) Building footprints polygons were obtained from the publicly available geodata BAG [25] and used as training and reference data in our experimental analysis. The BAG is part of the government system of key registers captured by a municipality and subcontractors. There are some footprints that are not aligned with the ground truth, most of which are caused by human activities, such as buildings that are planned to be constructed but not yet started or buildings that are planned to be demolished and have been demolished. We edited them manually. Buildings with shared walls are difficult to distinguish for the network. The “dissolve” operation in QGIS was applied to BAG’s original polygons to merge them into one. The dissolve results are shown in Figure 5. Composite image 1 (RGB + nDSM) was produced by stacking the nDSM with the original aerial image as the 4th band. Composite image 2 (RGB + NIR + nDSM) was produced by stacking the NIR as the 4th band and nDSM as the 5th band with the original aerial image.

#### 3.2. Evaluation Metrics

**Pixel-level metrics**. For evaluating the results, we used the IoU. IoU is computed by dividing the intersection area by the union area of a predicted segmentation (p) and a ground truth (g) at the pixel level.

**Object-level metrics.**Average precision (AP) and average recall (AR), defined in MS COCO measures, are introduced to evaluate our results. AP and AR are calculated based on multiple IoU values. IoU is the intersection of the predicted polygon with the ground truth polygon divided by the union of the two polygons. There are 10 IoU thresholds ranging from 0.50 to 0.95 with 0.05 steps. For each threshold, only the predicted results with IoU above the threshold will be count as true positives (tp). The rest will be denoted as false positives (fp). The ground truth with an IoU smaller than the threshold is a false negative (fn) [9]. Then, we use Equations (16) and (17) to calculate the corresponding precision and recall. AP and AR are the average values of all precisions and recalls, respectively, calculated over 10 IoU categories and can be denoted as mAP and mAR. AP and AR are also calculated based on the size of the objects: small (area < 32

^{2}), medium (32

^{2}< area < 96

^{2}), and large (area > 96

^{2}). The area is measured as the number of pixels in the segmentation mask. They can be denoted as AP

_{S}, AP

_{M}, and AP

_{L}for the precision and AR

_{S}, AR

_{M}, and AR

_{L}for the recall. We followed the same metric standards but applied them to building polygons directly. To be specific, the IoU calculation was performed based on polygons. For the alternative method, PolyMapper, as the data are in COCO format, the evaluation is based on segmentation in raster format.

**Polygon-level metrics.**Besides the COCO metrics, polygons and line segments measurements (PoLiS) were introduced to evaluate the similarity of the predicted polygons to corresponding reference polygons. It accounts for positional and shape differences by considering polygons as a sequence of connected edges instead of only point sets [26]. We used this metric to evaluate the quality of the predicted polygon. We first filtered the polygons with IoU ≥ 0.5 to find the prediction polygons and the corresponding reference polygons. The metric is expressed as follows:

#### 3.3. Implementation Details

## 4. Results

#### 4.1. Quantitative Analysis

#### 4.2. Qualitative Analysis

#### 4.3. Analysis of the Number of Vertices

#### 4.4. Comparison with Alternative Methods

#### 4.5. Limitations and Insights

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Nahhas, F.H.; Shafri, H.Z.M.; Sameen, M.I.; Pradhan, B.; Mansor, S. Deep Learning Approach for Building Detection Using LiDAR-Orthophoto Fusion. J. Sens.
**2018**, 2018, 7212307. [Google Scholar] [CrossRef] [Green Version] - Sohn, G.; Dowman, I. Data Fusion of High-Resolution Satellite Imagery and LiDAR Data for Automatic Building Extraction. ISPRS J. Photogramm. Remote Sens.
**2007**, 62, 43–63. [Google Scholar] [CrossRef] - Zhao, W.; Persello, C.; Stein, A. Building Outline Delineation: From Aerial Images to Polygons with an Improved End-to-End Learning Framework. ISPRS J. Photogramm. Remote Sens.
**2021**, 175, 119–131. [Google Scholar] [CrossRef] - Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
- Shao, Z.; Tang, P.; Wang, Z.; Saleem, N.; Yam, S.; Sommai, C. BRRNet: A Fully Convolutional Neural Network for Automatic Building Extraction from High-Resolution Remote Sensing Images. Remote Sens.
**2020**, 12, 1050. [Google Scholar] [CrossRef] [Green Version] - Wei, S.; Ji, S.; Lu, M. Toward Automatic Building Footprint Delineation from Aerial Images Using CNN and Regularization. IEEE Trans. Geosci. Remote Sens.
**2020**, 58, 2178–2189. [Google Scholar] [CrossRef] - He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Zhang, L.; Wu, J.; Fan, Y.; Gao, H.; Shao, Y. An Efficient Building Extraction Method from High Spatial Resolution Remote Sensing Images Based on Improved Mask R-CNN. Sensors
**2020**, 20, 1465. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Girard, N.; Smirnov, D.; Solomon, J.; Tarabalka, Y. Polygonal Building Extraction by Frame Field Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 5891–5900. [Google Scholar]
- Li, Z.; Wegner, J.D.; Lucchi, A. Topological Map Extraction from Overhead Images. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; IEEE: Piscataway, NJ, USA, 2019; Volume 2019, pp. 1715–1724. [Google Scholar] [CrossRef] [Green Version]
- Vaxman, A.; Campen, M.; Diamanti, O.; Panozzo, D.; Bommes, D.; Hildebrandt, K.; Ben-Chen, M. Directional Field Synthesis, Design, and Processing. Comput. Graph. Forum
**2016**, 35, 545–572. [Google Scholar] [CrossRef] [Green Version] - Huang, J.; Zhang, X.; Xin, Q.; Sun, Y.; Zhang, P. Automatic Building Extraction from High-Resolution Aerial Images and LiDAR Data Using Gated Residual Refinement Network. ISPRS J. Photogramm. Remote Sens.
**2019**, 151, 91–105. [Google Scholar] [CrossRef] - Hong, D.; Gao, L.; Yokoya, N.; Yao, J.; Chanussot, J.; Du, Q.; Zhang, B. More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing Imagery Classification. IEEE Trans. Geosci. Remote Sensing.
**2020**. [Google Scholar] [CrossRef] - Al-Najjar, H.A.H.; Kalantar, B.; Pradhan, B.; Saeidi, V.; Halin, A.A.; Ueda, N.; Mansor, S. Land Cover Classification from Fused DSM and UAV Images Using Convolutional Neural Networks. Remote Sens.
**2019**, 11, 1461. [Google Scholar] [CrossRef] [Green Version] - Bittner, K.; Adam, F.; Cui, S.; Körner, M.; Reinartz, P. Building Footprint Extraction From VHR Remote Sensing Images Combined With Normalized DSMs Using Fused Fully Convolutional Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.
**2018**, 11, 2615–2629. [Google Scholar] [CrossRef] [Green Version] - Schuegraf, P.; Bittner, K. Automatic Building Footprint Extraction from Multi-Resolution Remote Sensing Images Using a Hybrid FCN. ISPRS Int. J. Geo-Inf.
**2019**, 8, 191. [Google Scholar] [CrossRef] [Green Version] - Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef] [Green Version]
- Diamanti, O.; Vaxman, A.; Panozzo, D.; Sorkine-Hornung, O. Designing N-Polyvector Fields with Complex Polynomials. Eurograph. Symp. Geom. Process.
**2014**, 33, 1–11. [Google Scholar] [CrossRef] - Lorensen, W.E.; Cline, H.E. Marching Cubes: A High Resolution 3d Surface Construction Algorithm. Comput. Graph.
**1987**, 21, 163–169. [Google Scholar] [CrossRef] - Kass, M.; Witkin, A. Snakes: Active Contour Models; KIuwer Academic Publishers: Berlin/Heidelberg, Germany, 1988. [Google Scholar]
- Hashemi, S.R.; Sadegh, S.; Salehi, M.; Erdogmus, D.; Prabhu, S.P.; Warfield, S.K.; Gholipour, A.; Hashemi, S.R. Tversky as a Loss Function for Highly Unbalanced Image Segmentation Using 3D Fully Convolutional Deep Networks; Springer International Publishing: New York, NY, USA, 2018. [Google Scholar]
- Kadaster (The Netherlands’ Cadastre, Land Registry and Mapping Agency). Available online: https://www.devex.com/organizations/the-netherlands-cadastre-land-registry-and-mapping-agency-kadaster-29602 (accessed on 19 May 2019).
- PDOK (the Public Services On the Map). Available online: https://www.pdok.nl/ (accessed on 19 May 2019).
- AHN ((Het Actueel Hoogtebestand Nederland). Available online: https://www.ahn.nl/ (accessed on 19 May 2019).
- BAG (Basisregistratie Adressen en Gebouwen). Available online: https://www.geobasisregistraties.nl/basisregistraties/adressen-en-gebouwen (accessed on 19 May 2019).
- Avbelj, J.; Muller, R.; Bamler, R. A Metric for Polygon Comparison and Building Extraction Evaluation. IEEE Geosci. Remote Sens. Lett.
**2015**, 12, 170–174. [Google Scholar] [CrossRef] [Green Version] - Ghamisi, P.; Yokoya, N. IMG2DSM: Height Simulation from Single Imagery Using Conditional Generative Adversarial Net. IEEE Geosci. Remote Sens. Lett.
**2018**, 15, 794–798. [Google Scholar] [CrossRef] - Paoletti, M.E.; Haut, J.M.; Ghamisi, P.; Yokoya, N.; Plaza, J.; Plaza, A. U-IMG2DSM: Unpaired Simulation of Digital Surface Models with Generative Adversarial Networks. IEEE Geosci. Remote Sens. Lett.
**2021**, 18, 1288–1292. [Google Scholar] [CrossRef]

**Figure 1.**The red polygon represents the building outline with the slanted wall. At least one field direction of the frame field is aligned with the tangent line of the contour when it locates along the edge.

**Figure 2.**The workflow of the investigated frame field method for building delineation fusing nDSM and RGB data. Adapted from [1].

**Figure 4.**(

**a**) Sample data of LiDAR point clouds; (

**b**) the derived DSM with 0.5 m of spatial resolution.

**Figure 6.**(

**a**) The urban area is denoted by the red polygons; (

**b**) the tile distribution for the urban area.

**Figure 7.**PoLiS distance p between extracted building A (green) and reference building B (brown) marked with solid black lines (modified from [3]).

**Figure 8.**Results obtained on two tiles of the test dataset for the urban area. The loss functions are cross-entropy and Dice. The background is the aerial image and the corresponding nDSM. The predicted polygons are produced with 1 pixel for the tolerance parameter of the polygonization method. From left to right: (

**a**) reference building footprints; (

**b**) predicted polygons on aerial images (RGB); (

**c**) predicted polygons on nDSM; (

**d**) predicted polygons on composite image 1 (RGB + nDSM); (

**e**) predicted polygons on composite image 2 (RGB + NIR + nDSM).

**Figure 9.**Results obtained on the urban area dataset. The predicted polygons are produced with 1 pixel for the tolerance parameter of the polygonization method. From left to right: (

**a**) reference building footprints; (

**b**) predicted polygon on aerial images (RGB); (

**c**) predicted polygon on nDSM; (

**d**) predicted polygon on composite image 1 (RGB + nDSM); (

**e**) predicted polygon on composite image 2 (RGB + NIR + nDSM).

**Figure 10.**Results obtained on the urban area test dataset (RGB + NIR + nDSM). The predicted polygons are produced with 1 pixel for the tolerance parameter of the polygonization method. (

**a**) Reference building footprints; (

**b**) predicted polygons with cross-entropy and Dice as loss functions; (

**c**) predicted polygons with Tversky as loss function.

**Figure 11.**Results obtained on the urban area test dataset (RGB + nDSM) with high mean IoU. The predicted polygons are produced with 1 pixel for the tolerance parameter of the polygonization method. (

**a**) Predicted polygons with mean IoU 1; (

**b**) predicted polygons with mean IoU 0.955; (

**c**) predicted polygons with mean IoU 0.951; (

**d**) predicted polygons with mean IoU 0.937.

**Figure 12.**Results obtained on the urban area test dataset (RGB + nDSM) with low mean IoU. The predicted polygons are produced with 1 pixel for the tolerance parameter of the polygonization method. (

**a**) Predicted polygons with mean IoU 0; (

**b**) predicted polygons with mean IoU 0.195; (

**c**) predicted polygons with mean IoU 0.257; (

**d**) predicted polygons with mean IoU 0.345.

**Figure 13.**Results obtained on the urban area test dataset (RGB + nDSM) with low mean IoU. The predicted polygons are produced with 1 pixel for the tolerance parameter of the polygonization method. For the nDSM, the lighter, the higher. (

**a**) Predicted polygons with mean IoU 0; (

**b**) predicted polygons with mean IoU 0.345.

**Figure 14.**Example polygon obtained with different tolerance values using the composite image 1 (RGB + nDSM). (

**a**) Reference polygon; (

**b**) predicted polygon with tolerance of 1 pixel; (

**c**) predicted polygon with tolerance of 3 pixels; (

**d**) predicted polygon with tolerance of 5 pixels; (

**e**) predicted polygon with tolerance of 7 pixels; (

**f**) predicted polygon with tolerance of 9 pixels.

**Figure 15.**Results obtained using aerial images (RGB). (

**a**) Reference building footprints; (

**b**) predicted polygons with 1-pixel tolerance parameter of the polygonization method by frame field learning method; (

**c**) predicted polygons by PolyMapper.

**Table 1.**The training set, validation set, and test set for the urban area using BAG reference polygons. The size of each tile is 1024 × 1024 pixels.

Dataset | Number of Tiles | Number of Buildings | Ratio |
---|---|---|---|

Training | 579 | 29,194 | 0.7 |

Validation | 82 | 4253 | 0.1 |

Test | 165 | 8531 | 0.2 |

**Table 2.**Results for the urban area dataset. The mean IoU is calculated on the pixel level. Other metrics are calculated on the polygons with 1-pixel tolerance for polygonization.

Bands | Loss Function | Mean IoU | mAP | mAR | F1 | AP_{S} | AR_{S} | AP_{M} | AR_{M} | AP_{L} | AR_{L} |
---|---|---|---|---|---|---|---|---|---|---|---|

RGB, NIR, nDSM | BCE + Dice | 0.805 | 0.425 | 0.499 | 0.447 | 0.262 | 0.200 | 0.591 | 0.609 | 0.543 | 0.478 |

Tversky | 0.814 | 0.430 | 0.413 | 0.412 | 0.218 | 0.244 | 0.457 | 0.507 | 0.502 | 0.376 | |

RGB, nDSM | BCE + Dice | 0.800 | 0.410 | 0.488 | 0.433 | 0.255 | 0.198 | 0.576 | 0.593 | 0.534 | 0.465 |

Tversky | 0.776 | 0.371 | 0.399 | 0.373 | 0.204 | 0.197 | 0.441 | 0.482 | 0.464 | 0.650 | |

RGB | BCE + Dice | 0.568 | 0.067 | 0.253 | 0.102 | 0.139 | 0.024 | 0.285 | 0.261 | 0.248 | 0.232 |

nDSM | BCE + Dice | 0.767 | 0.313 | 0.436 | 0.347 | 0.197 | 0.129 | 0.532 | 0.553 | 0.525 | 0.420 |

**Table 3.**PoLiS results for the urban area dataset. The PoLiS are calculated on the polygons with 1-pixel tolerance for polygonization.

Bands | Loss | PoLiS |
---|---|---|

RGB, NIR, nDSM | BCE + Dice | 0.52 |

Tversky | 0.62 | |

RGB, nDSM | BCE + Dice | 0.54 |

Tversky | 0.62 | |

RGB | BCE + Dice | 0.87 |

nDSM | BCE + Dice | 0.62 |

**Table 4.**Results for the urban area dataset. The mean IoU is calculated on the pixel level. Other metrics are calculated on the polygons with 1-pixel tolerance for polygonization. The polygons a, b, c, d, and e correspond to the polygons (a), (b), (c), (d), and (e) in Figure 9.

Polygon | Dataset | PoLiS | Vertices |
---|---|---|---|

a | reference | 74 | |

b | RGB | 5.32 | 612 |

c | nDSM | 0.81 | 44 |

d | RGB + nDSM | 0.47 | 112 |

e | RGB + NIR + nDSM | 0.39 | 111 |

**Table 5.**Polygon obtained with different tolerances using the composite image 1 (RGB + nDSM) for urban area dataset.

Tolerance | PoLiS | RMSE | Average Ratio of Vertices Number | Average Difference of Vertices Number |
---|---|---|---|---|

1 | 0.536 | 80.40 | 1.621 | 5.327 |

3 | 0.567 | 81.44 | 1.026 | −3.588 |

5 | 0.588 | 83.07 | 0.935 | −5.236 |

7 | 0.611 | 71.50 | 0.899 | −5.426 |

9 | 0.636 | 72.96 | 0.872 | −6.138 |

**Table 6.**Example polygon with different tolerances and numbers of vertices. The polygons a, b, c, d, e, and f correspond to the polygons (a), (b), (c), (d), (e), and (f) in Figure 14.

Polygon | Tolerance | PoLiS | Vertices | Ratio |
---|---|---|---|---|

a | 74 | |||

b | 1 | 0.472 | 112 | 1.514 |

c | 3 | 0.537 | 52 | 0.703 |

d | 5 | 0.628 | 35 | 0.473 |

e | 7 | 0.697 | 29 | 0.392 |

f | 9 | 0.763 | 26 | 0.351 |

**Table 7.**Results obtained using aerial images (RGB) for the urban area dataset. The metrics are calculated on the polygons with 1-pixel tolerance for polygonization for the frame field learning-based method.

Method | mAP | mAR | AP_{S} | AR_{S} | AP_{M} | AR_{M} | AP_{L} | AR_{L} |
---|---|---|---|---|---|---|---|---|

PolyMapper | 0.009 | 0.017 | 0.001 | 0.001 | 0.004 | 0.028 | 0.014 | 0.065 |

Frame field | 0.067 | 0.253 | 0.139 | 0.024 | 0.285 | 0.261 | 0.248 | 0.232 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Sun, X.; Zhao, W.; Maretto, R.V.; Persello, C.
Building Polygon Extraction from Aerial Images and Digital Surface Models with a Frame Field Learning Framework. *Remote Sens.* **2021**, *13*, 4700.
https://doi.org/10.3390/rs13224700

**AMA Style**

Sun X, Zhao W, Maretto RV, Persello C.
Building Polygon Extraction from Aerial Images and Digital Surface Models with a Frame Field Learning Framework. *Remote Sensing*. 2021; 13(22):4700.
https://doi.org/10.3390/rs13224700

**Chicago/Turabian Style**

Sun, Xiaoyu, Wufan Zhao, Raian V. Maretto, and Claudio Persello.
2021. "Building Polygon Extraction from Aerial Images and Digital Surface Models with a Frame Field Learning Framework" *Remote Sensing* 13, no. 22: 4700.
https://doi.org/10.3390/rs13224700