Open Data and Deep Semantic Segmentation for Automated Extraction of Building Footprints

: Advances in machine learning and computer vision, combined with increased access to unstructured data (e.g., images and text), have created an opportunity for automated extraction of building characteristics, cost-effectively, and at scale. These characteristics are relevant to a variety of urban and energy applications, yet are time consuming and costly to acquire with today’s manual methods. Several recent research studies have shown that in comparison to more traditional methods that are based on features engineering approach, an end-to-end learning approach based on deep learning algorithms signiﬁcantly improved the accuracy of automatic building footprint extraction from remote sensing images. However, these studies used limited benchmark datasets that have been carefully curated and labeled. How the accuracy of these deep learning-based approach holds when using less curated training data has not received enough attention. The aim of this work is to leverage the openly available data to automatically generate a larger training dataset with more variability in term of regions and type of cities, which can be used to build more accurate deep learning models. In contrast to most benchmark datasets, the gathered data have not been manually curated. Thus, the training dataset is not perfectly clean in terms of remote sensing images exactly matching the ground truth building’s foot-print. A workﬂow that includes data pre-processing, deep learning semantic segmentation modeling, and results post-processing is introduced and applied to a dataset that include remote sensing images from 15 cities and ﬁve counties from various region of the USA, which include 8,607,677 buildings. The accuracy of the proposed approach was measured on an out of sample testing dataset corresponding to 364,000 buildings from three USA cities. The results favorably compared to those obtained from Microsoft’s recently released US building footprint dataset.


Introduction
Building footprint extraction can be used in several application areas such as population density estimation [1,2], urban planning and mapping, building energy modeling and analytics [3,4], and disaster management [5][6][7]. The use of high-resolution remote sensing images (i.e., satellite and aerial) have been increasingly explored to obtain building footprint information. While the identification of building geometry from this type of imagery is time consuming and costly to perform manually, automatic feature extraction methods hold great promise. Image semantic segmentation methods, which address the problem of assigning a categorical label (class) to each pixel of an image, is one of the most commonly studied approaches for automatic extraction of features from remote sensing images. Until recently, the semantic segmentation algorithms that have been employed to process remote sensing images were based on image feature engineering, which were often hand-crafted for each situation based on the category of object that was considered. However, due to the high variability of building footprint appearances, environmental characteristics such as surrounding vegetation and terrain conditions, as well as the impact of the type of sensor that has been used to collect the imagery (e.g., different resolution), methods based on hand-crafted features are difficult to design and have not produced accurate results.
The recent development of deep learning algorithms has considerably improved the ability to accurately extract information from images using machine learning (ML). These advances have triggered the interest of the remote sensing community in the adoption of this type of method to extract a variety of information from satellite and aerial imagery. More specifically, deep convolutional neural network semantic segmentation algorithms have revolutionized the computer vision field by providing an end-to-end learning approach that does not need feature engineering. Deep convolutional neural network algorithms learn different image features by building a hierarchy of data representation, which make this type of algorithm very efficient in processing complex real-world images such as remote sensing images. Thus, deep convolutional neural networks have been applied to several tasks in extracting information from satellite and aerial images such as road extraction [8,9], farm segmentation [10], building footprint segmentation [11][12][13][14] and building damage assessment [15,16]. These existing research studies have shown that the end-to-end learning approach based on deep learning algorithms significantly improved the accuracy of automatic building footprint extraction in comparison to more traditional methods that are based on the features engineering approach.
One of the biggest challenges of semantic segmentation is the availability of labeled training datasets. Most of the existing studies of building footprint extraction using deep convolutional neural networks employed benchmark datasets that have been manually labeled and curated [17,18]. However, these datasets are limited to a small number of U.S. cities, making it difficult to use them to develop a model that can be generalized to the entire U.S. territory, because of the high degree of variability in building shapes and US terrain topography. For instance, Maggiori et al. [13] include six US cities (i.e., Austin, TX; Chicago, IL; Kitsap county, WA; Bellingham, WA; San Francisco CA; Bloomington, IN) and there is just one city (i.e., Las Vegas, NV) in [17]. In addition, because of the costly manual labeling and curating process, the approach of manually creating training data is not a scalable solution, especially when a significantly high number of cities are considered. Openly available GIS data are another option that can be considered for automatically generating labeled training datasets.
OpenStreetMap (OSM) is currently the largest openly licensed database of geospatial data (i.e., GIS data). It is a global dataset that is a collaborative crowd-source product, supported by more than one million volunteers who have contributed to editing the database content. It includes different types of GIS information, such as road infrastructure and built environment, and it is used in many projects as an alternative to proprietary or authoritative data. While the crowd sourced nature of OSM has been a key to its success, the small number of volunteers with professional GIS experience has raised significant concerns about its accuracy. Several studies [19,20] have performed an assessment of OSM spatial accuracy and completeness of building footprints. However, because all these studies have focused their analysis on very limited geographical regions, it is not reliable enough to extrapolate a general conclusion regarding the quality of OSM in terms of building footprint. Touzani et al. [21] conducted, on a very limited sample of U.S. cities, a qualitative analysis of the OSM data accuracy and availability regarding building footprints. The authors noted that quality and availability vary significantly, and that this variability is highly correlated to whether or not the city's authorities have made the building footprints openly available.
An increasing number of U.S. cities provide public building datasets through an easy to access open data web portal. Building footprints are provided by several cities (e.g., San Francisco, Washington D.C., Boston, Los Angeles, Chicago, New York) through such portals. The information is usually accessible via an application programming interface (API) or direct download, and it is available as a geographic information system (GIS) data file. In some cases, the height and some form of building ID is also provided. Using existing GIS tools, this type of dataset is generally easy to process. However, it is important Remote Sens. 2021, 13, 2578 3 of 16 to note that, although most of the city-provided data explored by the authors is regularly updated, some was found to be as old as 5 to 10 years, and therefore outdated.
The aim of this work is to use recent advances in deep semantic segmentation algorithms to extract building footprints from satellite images by leveraging the US openly available building footprint GIS data to automatically generate a large training dataset that, in comparison to the available benchmark datasets, has more variability in term of regions and type of city, which can be used to build more accurate deep learning models. However, in contrast to the previously mentioned benchmark datasets, in this study the generated training data are not manually curated. Thus, it is not perfectly clean in terms of the remote sensing images exactly matching the ground truth building's foot-print. Another aim of this work is to explore how feasible it is to train an accurate model by using an approach that avoids the costly manual curation of the training dataset. The major contributions of this project are: • Builds on previous research works, a framework for building feature extraction from satellite/aerial images is proposed. This uses a state-of-the-art deep semantic segmentation algorithm and a postprocessing step that convert predicted probability maps into GIS files that include building footprint polygons. It also utilizes openly available building foot-print GIS files to automatically generate labeled training data, which is a more scalable solution in comparison to the approaches that are based on the costly manual generation of training data.

•
To the best of our knowledge this is the first attempt to explore the usage of such a high volume of training data (i.e., 15 cities and five counties from various regions of the USA, which include 8,607,677 buildings) that have been automatically generated using openly available public records and that have not been manually curated/de-noised.

•
The proposed pipeline is tested and compared with another dataset of US building footprints that has been generate using a framework that also uses deep semantic segmentation and satellite images.  Table A1 in the Appendix A), and in total include 8,607,677 building foot-print polygons.

Remote Sensing Imagery
Several providers offer high resolution satellite or aerial images, either through an API or a direct download. However, one of the biggest challenges of using data that can be collected from these imagery companies is the licensing agreement, which can tightly limit the type of analysis that can be performed on these datasets and sharing of resulting derivate works (e.g., extracting a footprint from aerial images). Several providers allow free access to satellite and aerial imagery (e.g., Bing map, Google Maps), but they do not grant the user the right to download the images and to create derivative works.
Satellite/aerial orthoimages were collected from the satellite layer of Mapbox Maps [22]. Mapbox provides a high-resolution mix of satellite and aerial images that are a compilation of several commercial and open data sources. Mapbox allows researchers to freely use these images for non-commercial academic applications [23]. The Mapbox images are color-corrected and stored in raster format, which is a pixel-based data format "that efficiently represents continuous surfaces" [22]. The Mapbox images were retrieved using the Mapbox Raster Tiles API. For each GIS building footprint collected file, we created a list of Remote Sens. 2021, 13, 2578 4 of 16 tile coordinates that cover the specific region. These coordinates follow the Mapbox Raster Tiles API default format, which is the slippy maps standard [24] that defines each tile by the zoom level, which in turn defines the resolution and the tile coordinates. The zoom level that was selected for this work was 19. This level was chosen based on an empirical analysis that tested several zoom levels in the machine learning pipeline and compared the accuracy of the results on a very limited dataset. The total number of images (RGB) collected at zoom level 19 that cover geographical regions for which building footprints are available is 2,432,019,512 by 512 pixels in size.
It is important to note that because the temporal resolution of the Mapbox imagery is not provided, the data used to train and evaluate the proposed approach will not be perfectly clean in terms of matching the ground truth building footprint to the satellite/aerial images. Thus, our work is based on noisier data than the previously cited research in the field. An example of this noise can be seen in Figure 1 images a and b; in the upper right corner it is clear that the footprint of the second building from the top is missing from the ground truth building footprint GIS files that were used to generate the "ground truth mask". Having a discrepancy/noise between the truth and the training data is a more realistic scenario when the end-to-end deep learning approach are applied at scale, because manually curating millions of training images is not a trackable solution.
• Prediction transformation (i.e., converting predictions that are pixel-based masks into polygons with geographic coordinates)

Data Pre-Processing
As previously stated, the downloaded Mapbox tiles are RGB files 512 by 512 pixels in size, and no additional processing was applied to these images. For each image, the corresponding feature mask image was produced by reformatting the gathered building footprint GIS files into binary images. The feature mask is the binary representation of the building footprints that are contained in the corresponding image, where pixels corresponding to a building have a value of 1, and non-building pixels have a value of 0. In dense built-up areas, such as downtown areas, the feature masks of adjacent buildings were not distinguishable. To overcome this challenge, the size of each building footprint polygon was decreased by a factor of ~8%, in terms of surface area. Figure 1 shows an example of a satellite image (image a) with the corresponding mask before downsizing the polygons (image b), and after downsizing the polygons (image c).
The Newark and Houston datasets were put aside as an independent sample to test the accuracy of the proposed methodology. For the remaining data, for each city/county dataset the pairs of feature mask and satellite/aerial images were randomly divided into three samples to be used for training, validation and testing. The training dataset has ~60% of the total number of pairs (n = 1.5 M pairs), the validation dataset has ~20% (n = 0.45 M pairs) and the test dataset also has ~20% (n = 0.45 M). The validation dataset is used to select the best model during the training process and the test sample is used as an independent sample to estimate the semantic segmentation accuracy of the trained model.

Model
Semantic segmentation is a computer vision family of algorithms that aims to assign a class to every pixel of an image. Recent advances in deep neural network architectures, While this work was limited to Mapbox images, another potential source for gathering free aerial vertical imagery is cities' open data portals. Cities such as New York City, Los Angeles, Portland, and Washington, D.C. regularly release high-resolution aerial imagery through their open data platforms.

Microsoft Building Footprint Dataset
Recently, Microsoft released building footprint GIS data covering the entire US territory. These data were generated from the satellite/aerial images available to Bing Maps.
Similar to the methodology that is presented in this work, Microsoft's building footprints were generated using a two-step approach comprising a deep learning semantic segmentation, followed by post processing. Unfortunately, there is no published information about which geographical regions were used to train the semantic segmentation model. Some information about the modeling and the post processing approach has been given in [25]. In this work we leveraged this data as a comparative benchmark.

Methodology
In this section we describe a workflow for building footprint extraction. All the code for the components of the proposed workflow is made openly available (i.e., AutoBFE (https://github.com/LBNL-ETA/AutoBFE, accessed on 16 March 2021)). This workflow is composed of three steps:

1.
A data preprocessing step, which aims to generate training data with a very limited manual effort using openly available data sources: • Automatic generation of training feature masks by querying city/county footprint GIS open data 2.
Deep learning modeling, using a state-of-the-art fully convolutional neural network architecture model, DeeplabV3+ 3.
A postprocessing of the model results step, which aims to generate results that are easily transformable into GIS data formats: • Prediction cleaning • Prediction transformation (i.e., converting predictions that are pixel-based masks into polygons with geographic coordinates)

Data Pre-Processing
As previously stated, the downloaded Mapbox tiles are RGB files 512 by 512 pixels in size, and no additional processing was applied to these images. For each image, the corresponding feature mask image was produced by reformatting the gathered building footprint GIS files into binary images. The feature mask is the binary representation of the building footprints that are contained in the corresponding image, where pixels corresponding to a building have a value of 1, and non-building pixels have a value of 0. In dense built-up areas, such as downtown areas, the feature masks of adjacent buildings were not distinguishable. To overcome this challenge, the size of each building footprint polygon was decreased by a factor of~8%, in terms of surface area. Figure 1 shows an example of a satellite image (image a) with the corresponding mask before downsizing the polygons (image b), and after downsizing the polygons (image c).
The Newark and Houston datasets were put aside as an independent sample to test the accuracy of the proposed methodology. For the remaining data, for each city/county dataset the pairs of feature mask and satellite/aerial images were randomly divided into three samples to be used for training, validation and testing. The training dataset has 60% of the total number of pairs (n = 1.5 M pairs), the validation dataset has~20% (n = 0.45 M pairs) and the test dataset also has~20% (n = 0.45 M). The validation dataset is used to select the best model during the training process and the test sample is used as an independent sample to estimate the semantic segmentation accuracy of the trained model.

Model
Semantic segmentation is a computer vision family of algorithms that aims to assign a class to every pixel of an image. Recent advances in deep neural network architectures, and especially the fully convolutional networks (FCNs), have been shown to be very effective for semantic segmentation tasks on several benchmark datasets [26]. In this study we employed the DeepLabv3+ model, as it has demonstrated more precise semantic segmentations than Unet [27] and SegNet [28] on various semantic tasks including segmentation of building footprints from satellite/aerial imagery. The DeepLabv3+ model uses an encoder-decoder network architecture with an atrous spatial pyramid pooling (ASPP) module to extract multi-scale contextual information by pooling features at various resolutions. These two features address some of the limitations of FCN algorithms, such as the challenges in segmenting small and complex objects, reduced feature resolution due to the consecutive pooling operations that cause the loss of detailed spatial features, and the existence of objects at multiple scales.
In the encoder-decoder structure, the encoder module extracts abstract features from the input images by gradually reducing the feature maps. The decoder module is responsible for recovering spatial resolution and location information by gradually up-sampling the feature maps. In this work, the considered output stride was equal to 16, which was shown [29] to be the best trade-off between computational speed and accuracy. The output stride is the ratio of input image spatial resolution to the final output of the encoder. In the decoder module the encoder features are first bilinearly up-sampled by a factor of 4 and then concatenated with the corresponding low-level features maps of the same resolution from the encoder module. After the concatenation, a few 3 by 3 convolutions are applied to refine the features followed by a bilinear up-sampling by a factor of 4. In this work, a pretrained (on ImageNet dataset [30]) ResNet-101 architecture [31] was used as DeepLabv3+ network backbone. For a more detailed description of the DeepLav3+ architecture we refer the reader to the original paper [29].

Loss Function
The loss function has an essential impact on the model accuracy and usually the most suitable loss function will depend on the data properties and the class definitions. In this work, to train the DeepLabv3+ network, a combination of two loss functions was used; this choice was motivated by the results of an empirical analysis that tested several loss functions and compared the accuracy on a very limited dataset randomly selected from the total training sample. This loss function includes a weighted cross-entropy loss function [27] and the Dice loss function [32]. The cross-entropy loss measures the cumulative pixel-wise error probability between the predicted output class and the target class. To force the network to learn the separation borders between touching buildings, spatial weight maps are used for weighting the cross-entropy loss function at each pixel. These weight maps are pre-computed for each training ground truth mask using morphological operations: where d 1 is the distance to the border of the nearest building and d 2 is the distance to the border of the second nearest building. As in [27], we set w 0 = 10 and σ = 5. An example of spatial weight maps is depicted in image d of Figure 1. The spatially weighted cross-entropy loss is defined as: where X is the training sample and w(x i ) the computed spatial weight at pixel x i , and p(x i ) is the pixel-wise soft-max over the last DeepLabv3+ layer. The Dice loss function for multiclass segmentation also known as generalized Dice loss [32] is defined as: where ε is a small value added for numerical stability (set to 10 −6 ), C is the number of classes, r c is equal to 1 if the pixel corresponds to class c and equal to 0 otherwise, p c the soft-max prediction for class c. Therefore, the loss function used for training models in this work is defined as: where w CE and w Dice are the weights to each component of the loss function. In this work w CE and w Dice are set to be equal to 0.5.

Model Training
In order to reduce overfitting, a data augmentation strategy was employed to diversify the training data. This strategy consists of three forms of data augmentation: horizontal flip, vertical flip and a random color inference (i.e., random change of the brightness, contrast and saturation of the image). However, because of the large number of the images in the training sample, the data augmentation is performed randomly at each iteration with a probability of 0.4 (i.e., on an average at each epoch~40% of the training images/masks got a form of data augmentation applied to them). The Adam (adaptive moment estimation) optimization algorithm [33] was used with a starting learning rate set to 0.0001, the exponential decay rate of the first moment set to 0.9 and the second moment to 0.999. The learning rate was decayed every 25 epochs by a factor of two. The batch size was set to 16 and the number of epochs to 70. The training was performed on a platform with two NVIDIA V100 GPUs. The best epoch (i.e., best model) was selected using the accuracy metric mIoU computed at each epoch on the validation sample. In other words, the picked best model has the weights saved from the epoch that had the highest mIoU metric computed on the validation sample.

Post-Processing
The final step in the workflow is the postprocessing. The prediction masks obtained by the trained deepLabv3+ model need to be transformed to become meaningful footprint features that can be input into existing measure identification analysis tools. In this work we follow a post-processing pipeline similar to the one introduced by [34]. This pipeline performs the following operations:

1.
Converting probability maps into binary masks: The DeepLabv3+ model provides probability masks as outputs, which are converted into a binary mask using 0.5 as a threshold (see an example in Figure 2d).

2.
Opening morphological operation: this operation is used to remove small objects from the mask while preserving the size and the shape of the larger building footprints (see an example in Figure 2e). This is done by first eroding objects (i.e., remove noise and shrink objects) in the masks and then dilating the eroded objects (i.e., re-expand the objects).

3.
Polygonization: this is done by detecting the contours of each predicted building footprint, i.e., extracting the curve joining all the pixels having the same color (see an example in Figure 2f).

4.
Polygon simplification: this step aims to approximate the curve that forms the extracted contour with another curve with fewer vertices (see an example in Figure 2g). The distances between the two curves are less or equal to a precision that is specified by the user. In this work the Douglas-Peucker [35] algorithm was used.

5.
Converting the polygons' pixel-based coordinates into geographic coordinates: the result of this step is a GeoJSON GIS file. 6.
Polygons merging: in this step polygons that represent the same building footprint, yet were split due to tile boundaries, are merged to obtain a single building footprint. 7.
Increase the size of each detected polygon (i.e., building footprint): in order to improve the detection of building boundaries that are nearly overlapping, the surface area of each polygon in the training data was decreased by a factor of~8%. Therefore, the model was trained to underestimate. To overcome this, the reverse operation is applied, and the area of each polygon is increased by 8%. area of each polygon in the training data was decreased by a factor of ~8%. Therefore, the model was trained to underestimate. To overcome this, the reverse operation is applied, and the area of each polygon is increased by 8%.

Experiments and Results
The training data was used to train two models. The first model was trained using the loss function defined by Equation (4) in Section 3.2.2 (this model is called dicewce), and the second model used a modified version of this loss function where the spatial weighting was not applied in L CE (this model is called dicece). The accuracy of these two models is compared using the test sample. Using the model that was trained using L dicewce loss function, we generated building footprint predictions using satellite/aerial data from a region of the city of Houston, the entire city of Newark and a region of New York city. Note that while none of Newark and Houston were included in the training process, NY was, which means that 60% of the NY images were used in the training process. The decision to include NY in this analysis is justified by the fact that NY and especially Manhattan is highly dense and has a very specific type of architecture that creates a real challenge to a semantic segmentation approach of detecting building footprint. The resulting footprints were then compared to the building footprints in the dataset released by Microsoft.

Accuracy Metrics
The semantic segmentation accuracy of the trained models and the Microsoft data is measured using four metrics: mean Intersection over Union (mIoU), F1 score (F1), precision, and recall. These metrics are based on pixel predicted values represented by the total number of true positive (TP), false positive (FP) and false negatives (FN), which are computed using the number of pixels classified as buildings in the predicted masks and in the ground truth masks. Thus: To compute the mIoU, IoU is first calculated for each class (building and background) then averaged. IoU is defined as: Figure 3 shows the convergence during the training process of mIoU, which is the metric used to select the best model; note that the depicted mIoU are computed using the validation sample. One can see that when using dicece as loss function, the models converge faster than using dicewce and has a slightly better mIoU for the best model (the difference between the best dicece and best dicewce models is less than 0.004), which can be explained by the fact that it is harder to learn the boundaries between the adjacent building. Table 1. shows the building footprint segmentation accuracy metrics of the two trained DeepLabv3+ models (i.e., dicewce and dicece) computed using the test sample. These metrics show no significant difference in term of pixel wise semantic segmentation accuracy between the two models. The main difference is that dicewce has a slightly better performance than dicece in term of precision, while dicece has a slightly better performance in term of recall. However, when a visual analysis is performed of the predicted probabilities maps, it is clear that using the spatial weighting in the loss function significantly improves the detection of the separation between adjacent buildings. Figure 4 shows an example of these probability maps for a densely built region in San Francisco, CA. One can see that the gap between the adjacent buildings has a significantly lower probability of being classified as a building when the dicewce model was used. Figure 4 also show that dicewce is more conservative in labeling pixels between the adjacent buildings, which explains why it has a lower performance in term of recall and a better performance in term of precision in comparison to dicece.

Impact of the Spatial Weighting in the Loss Function
Remote Sens. 2021, 13, x FOR PEER REVIEW 10 of 16 also show that dicewce is more conservative in labeling pixels between the adjacent buildings, which explains why it has a lower performance in term of recall and a better performance in term of precision in comparison to dicece.

Comparison with Microsoft Building Footprint Data
The second analysis was performed by using the trained dicewce model to make predictions using the imagery from Newark, NJ and from the previously defined regions of NY city and Houston. The obtained probability maps for these three cities were postprocessed following the pipeline described in Section 3.3. The resulting building footprint GIS files (from step 7 in the postprocessing pipeline) were converted into masks and used to estimate accuracy metrics. Similarly, the Microsoft building footprint GIS files for the same three regions have been used to generate masks, which are used to estimate the accuracy metrics.
The accuracy results of the footprints obtained with the dicewce model and the Microsoft approach are presented in Table 2 for the three regions. The proposed approach achieved better results than Microsoft in terms of accuracy metrics for NY city and Houston, while Microsoft footprints are slightly better for Newark. Table 3 provides a comparison in terms of the number of detected buildings between the proposed approach and  also show that dicewce is more conservative in labeling pixels between the adjacent buildings, which explains why it has a lower performance in term of recall and a better performance in term of precision in comparison to dicece.

Comparison with Microsoft Building Footprint Data
The second analysis was performed by using the trained dicewce model to make predictions using the imagery from Newark, NJ and from the previously defined regions of NY city and Houston. The obtained probability maps for these three cities were postprocessed following the pipeline described in Section 3.3. The resulting building footprint GIS files (from step 7 in the postprocessing pipeline) were converted into masks and used to estimate accuracy metrics. Similarly, the Microsoft building footprint GIS files for the same three regions have been used to generate masks, which are used to estimate the accuracy metrics.
The accuracy results of the footprints obtained with the dicewce model and the Microsoft approach are presented in Table 2 for the three regions. The proposed approach achieved better results than Microsoft in terms of accuracy metrics for NY city and Houston, while Microsoft footprints are slightly better for Newark. Table 3 provides a comparison in terms of the number of detected buildings between the proposed approach and

Comparison with Microsoft Building Footprint Data
The second analysis was performed by using the trained dicewce model to make predictions using the imagery from Newark, NJ and from the previously defined regions of NY city and Houston. The obtained probability maps for these three cities were postprocessed following the pipeline described in Section 3.3. The resulting building footprint GIS files (from step 7 in the postprocessing pipeline) were converted into masks and used to estimate accuracy metrics. Similarly, the Microsoft building footprint GIS files for the same three regions have been used to generate masks, which are used to estimate the accuracy metrics.
The accuracy results of the footprints obtained with the dicewce model and the Microsoft approach are presented in Table 2 for the three regions. The proposed approach achieved better results than Microsoft in terms of accuracy metrics for NY city and Houston, while Microsoft footprints are slightly better for Newark. Table 3 provides a comparison in terms of the number of detected buildings between the proposed approach and Microsoft footprints. One can see that the tested approach significantly outperforms Microsoft results.
The number of separate buildings detected by our approach is~66% higher for Newark, 29% higher for Houston and~128% higher for NY city. It is important to note that we have performed an extensive visual inspection of the polygons generated by our method and we have not seen any serious over-segmentation issue.   7 provide three examples of comparison between the predicted building footprint produced by the proposed approach and the building footprint provided by Microsoft. In each of these figures, image (a) represents the remote sensing image, image (b) the remote sensing image overlayed with the ground truth mask generated using the gathered GIS files, image (c) the remote sensing image overlayed with the mask generated using the GIS files provided by Microsoft, and finally image (d) the remote sensing image overlayed by the mask generated from the GIS files produced by the proposed approach (using the deep learning model with dicewce loss function). These findings are driven by the fact that the Microsoft approach is not performing well at detecting the separation between buildings and building's courtyards, as is shown in Figure 5. Most of the time in a densely built environment, the Microsoft approach detects blocks of buildings rather than separate buildings. Our proposed approach performed much better at this task thanks to the employed loss function. However, our model also failed to detect the adjacent buildings' separation as can be seen in Figure 5 and more dramatically in Figure 6 where the building footprints have more complex shapes. Figure 7 shows a less dense area where Microsoft building footprints have a more realistic shape, which is due to the post processing pipeline that Microsoft used that performs well in producing realistic shapes in areas with lower building density. Figures 6 and 7 show that both approaches fail to accurately predict the boundaries of building footprints.

Conclusions
In this work, we described and evaluated an end-to-end approach for building footprint extraction from remote sensing images, which is based on deep semantic segmenta-

Conclusions
In this work, we described and evaluated an end-to-end approach for building footprint extraction from remote sensing images, which is based on deep semantic segmentation. The proposed workflow consists of three steps: data preprocessing, modeling, and

Conclusions
In this work, we described and evaluated an end-to-end approach for building footprint extraction from remote sensing images, which is based on deep semantic segmentation. The proposed workflow consists of three steps: data preprocessing, modeling, and results postprocessing. The data preprocessing step takes advantage of openly available building footprint GIS data to automatically generate ground truth masks that can be used for training. This addresses one of the most challenging aspects of deep learning based semantic segmentation-generating a sufficient amount of labeled training data. The second step of the workflow is the deep semantic segmentation modeling (i.e., model training). In this work we applied the state-of-the-art DeepLabv3+ semantic segmentation algorithm. The last step of the workflow is the semantic segmentation results postprocessing, which transforms the extracted building footprints into GIS format.
The developed approach was evaluated using data gathered from 15 cities and five counties, with promising results when compared to an existing yet closed, best-in-class solution from Microsoft. These encouraging results were obtained in spite of using noisy data with imperfect ground truth masks (i.e., ground truth masks in comparison to remote sensing images). These discrepancies were due to a difference in the vintage of the imagery and the building footprint records. Additional performance improvements might be gained if these discrepancies can be removed from the training dataset. One of the major weaknesses of the proposed approach is that it fails to predict accurate building boundaries. The predicted building footprints do not have regular shapes, as do actual building footprints. This is likely to be due to the fact that the semantic segmentation modeling approach and the post processing steps are not constrained by the building geometry. The noisy data that is used may also have an impact on the shapes' prediction accuracy.
Although the proposed approach was applied to building footprint extraction, it is possible to adapt the workflow to several other semantic segmentation tasks that are relevant to building energy applications, such as detecting the presence of photovoltaic panels, and packaged rooftop heating ventilation and air conditioning systems, especially as a high accuracy in detection of the shape of these objects is usually not as important as getting a descent accuracy of their size, which will be used as a proxy for extracting some energy related characteristics (e.g., energy production for photovoltaic panels). However, it is important to note that to the best of our knowledge there are no extensive open sources of GIS data that localize these types of objects. In addition, urban planning applications could benefit from the use of these techniques for road and vegetation detection.
Building upon the work presented in this paper, our future work will focus on improving the pipeline to obtain more realistic building footprints. This may involve improving the postprocessing module, or creating a solution to automatically clean the training data (e.g., automatically exclude tiles that have mismatch between ground truth mask and the remote sensing images), developing deep learning models that are more robust to noise. Another direction of our future research will be to explore the usage of Lidar data [36][37][38] in combination with remote sensing RGB images to improve building shape detection and, at the same time, provide estimates of building heights.