Comparing Three Machine Learning Techniques for Building Extraction from a Digital Surface Model

: Automatic building extraction from high-resolution remotely sensed data is a major area of interest for an extensive range of ﬁelds (e.g., urban planning, environmental risk management) but challenging due to urban morphology complexity. Among the different methods proposed, the approaches based on supervised machine learning (ML) achieve the best results. This paper aims to investigate building footprint extraction using only high-resolution raster digital surface model (DSM) data by comparing the performance of three different popular supervised ML models on a benchmark dataset. The ﬁrst two methods rely on a histogram of oriented gradients (HOG) feature descriptor and a classical ML (support vector machine (SVM)) or a shallow neural network (extreme learning machine (ELM)) classiﬁer, and the third model is a fully convolutional network (FCN) based on deep learning with transfer learning. Used data were obtained from the International Society for Photogrammetry and Remote Sensing (ISPRS) and cover the urban areas of Vaihingen an der Enz, Potsdam, and Toronto. The results indicated that performances of models based on shallow ML (feature extraction and classiﬁer training) are affected by the urban context investigated (F1 scores from 0.49 to 0.81), whereas the FCN-based model proved to be the most robust and best-performing method for building extraction from a high-resolution raster DSM (F1 scores from 0.80 to 0.86).


Introduction
In recent years, the availability of high-spatial-resolution remote sensing data has fostered the development of research methods and applications in several fields [1], such as urban planning, land monitoring, and environmental risk management [2][3][4][5] (e.g., urbanization delineation, vegetation mapping, flood modeling). The accurate information provided by the high-resolution data has been exploited both in large-scale problems, such as land use and land cover types [6,7], and small-scale ones, such as the extraction of urban objects: trees, roads, buildings, etc. [8][9][10]. Laser imaging detection and ranging (LiDAR) airborne mapping systems supply point clouds datasets containing 3-dimensional x, y, and z points and attributes to produce precise digital terrain model (DTM) and digital surface model (DSM) products-within a gridded or raster data format-in both natural and manmade environments [11].
Automatic building extraction from remotely sensed data is a major area of interest for an extensive range of applications but challenging due to difficulties in extracting precise boundaries because of urban morphology complexity [12]. Different methods have been proposed to address this issue [13], such as methods based on template matching, knowledge, object-based image analysis, and machine learning (ML) [14]. Commonly used template-matching-based approaches for automated buildings and urban objects footprints detection use "snake" or active contour model (ACM) [15] improvements [10,[16][17][18] and integration with ML and deep learning (DL) [19]. The approaches based on supervised ML can achieve the best results, especially along with DL [14], a subset of ML based on neural networks with representation learning [20]. Traditional computer vision techniques require manually engineered feature descriptors for the desired object detection, whereas DL models automate the process of feature engineering [21,22]. Generally, classical MLbased object detection methods involve two steps: feature extraction and classifier training. Among features descriptors, histogram of oriented gradients (HOG) [23] features have proven to be effective in describing the edge or local shape information of the urban objects [14,[24][25][26][27]. Typical ML algorithms for classifier training include, but are not limited to, the support vector machine (SVM) [7,[28][29][30], artificial neural network (ANN) [31,32], including the extreme learning machine (ELM) [33] and AdaBoost. Several successful buildings and urban objects detection approaches are based on DL methods [10,[34][35][36][37], such as convolutional networks, and in particular, fully convolutional networks (FCNs) have shown good performance on semantic segmentation [38][39][40][41][42], with correct pixel classification and accurate spatial information [43].
Despite the numerous algorithms proposed, building segmentation based only on the geometric information provided by DSM data is still a complicated task [44,45], mainly because objects with similar morphological characteristics and height can create ambiguity, resulting in position inaccuracy and local under-sampling [35,45].
A simplified method to detect and extract the building footprint based only on a DSM as input without any other additional feature could be beneficial in scenarios where only DSM data are available and a more expeditious solution is desirable (e.g., the rapid assessment of building damage). This paper aims to investigate the capability of three different, popular supervised ML models-namely HOG with SVM, HOG with ELM, and FCN-to detect building footprints using only raster DSM data as input and evaluate their performance on a publicly available benchmark dataset. This empirical comparison may highlight the potential usefulness of expeditious methods and help understand which model performs best in different urban environments. The first two methods rely on a HOG feature descriptor and a classical ML classifier (SVM) or a shallow neural network classifier (ELM) [46][47][48], and the FCN model is based on DL [49,50]. The descriptor and classification models were chosen for their accuracy [14,51], especially in segmentation tasks [52], popularity [14], simplicity, reduced computational time, and the ability to extract building footprint masks without reducing the resolution of raster data. Figure 1 shows the general workflow of our study. The core steps included data retrieval and preparation, model implementation and training, and test performance evaluation. All the algorithms were implemented using the MATLAB environment, except the FCN that was implemented with the Python programming language and the Torch package. The experiments were conducted on a Fujitsu Celsius Workstation with Intel Xeon E5-2643 CPU (3.30 GHz) and 16 GB RAM. The following subsections describe each step. Overview of the study workflow steps: data retrieval and preparation (labeling, resizing, and train/test split), model implementation and training (building extraction), and test performance evaluation.

Dataset Description
The data used in this study were obtained from the International Society for Photogrammetry and Remote Sensing (ISPRS) "Test Project on Urban Classification, 3-D Building Reconstruction, and Semantic Labeling". The sets consisting of airborne images and laser scanner data were made publicly available to evaluate and compare different urban object extraction methods, providing benchmark datasets with ground truth [13,53], more updated and complete than the existing ones [54][55][56].
For the sake of generalizability, the available datasets were chosen covering three urban areas that differ in terms of urban morphology.
The first part of the dataset was captured over the city of Vaihingen an der Enz (Germany) and originally obtained from the digital aerial cameras tests carried out by the German Association of Photogrammetry and Remote Sensing (Deutsche Gesellschaft für Photogrammetrie, Fernerkundung und Geoinformation, DGPF) [57]. Each of the 33 different-sized patches ( Figure 2) of the Vaihingen dataset contained a labeled true orthophoto (TOP) paired with a DSM [58] with a ground sampling distance of 9 cm. The publishers of the dataset provided a train/test split (30 patches for training, outlined with red, and 3 for testing, outlined with green). The morphology of Vaihingen presents the characteristics of a relatively small Central European village with many detached buildings and small multi-story buildings. It encompasses slightly different urban fabrics: relatively dense patterns formed by historic buildings with complex shapes and sparse trees, loose patterns formed by few high-rising residential buildings surrounded by trees, and regular patterns formed by small detached houses.
The second part of the dataset was captured over the city of Potsdam (Germany) and was originally obtained from the DGPF tests [58]. The Potsdam dataset consisted of 38 patches containing a labeled TOP paired with a DSM with a 6000 × 6000 pixel size and a ground-sampling distance of 5 cm. The provided train/test split for the Potsdam dataset ( Figure 3) consisted of 35 patches for training (outlined with red) and 3 for testing (outlined with green). The morphology of Potsdam represents a typical Central European historic city with large building blocks, narrow streets, and a dense settlement structure.  The third part of the dataset covered the city of Toronto (Canada) and consisted of 3 different-sized patches containing a TOP and a DSM interpolated from the airborne laser scanner point cloud with a grid width of 25 cm. The given train/test split ratio for the Toronto dataset ( Figure 4) was 2:1. The urban morphology of downtown Toronto exhibits the characteristics of a typical modern North American megacity as the presence of different urban objects and urban fabrics formed by a mixture of low-and high-story buildings with various degrees of shape complexity in rooftop structures and streets, including clusters of high-rise buildings that cast fairly large shadows [59].
The final database retrieved from the different ISPRS challenges was representative of the typical Central European and North American cities with respect to urban complexity, dimensions, and fabric density, especially suitable for method evaluation and comparison. The supervised learning models require a labeled set of data to learn, during the training process, the underlying patterns that can be used to make predictions on novel data.
To solve the building detection task, the building footprints were considered as a group of pixels that can be distinguished from the pixels representing any other objects that may appear in an urban environment (roads, paved areas, vegetation, etc.) through discriminatory features extracted from the DSM raster images. Thus, the input to the models and test data were lists of DSM tiles and the corresponding binary maps ( Figure 5) containing the ground truth for two semantic classes: building and non-building. For the Vaihingen and Potsdam sets, the binary masks were generated by selecting the building category from the provided ground truth (8-bit RGB tif files with one color per land cover class). For the Toronto set, the 8-bit binary masks were obtained by rasterizing the shapefiles describing the building outlines. The preprocessing phase consisted of rescaling the raster data with an adequate factor and applying nearest interpolation ( Figure 1). The initial mismatching resolutions-9 cm/pixel for the Vaihingen set, 5 cm/pixel for the Potsdam set, and 25 cm/pixel for the Toronto set-were conformed to a unique value of 36 cm/pixel to improve computational speed and performances. This value was obtained by progressively down-scaling a sample until a marked computational time reduction was seen without significant loss of information. The adjusted set of DSM tiles and masks was binned into training and test subsets.
Despite representing different and complex urban morphologies, the generated dataset showed a balanced class distribution. The number of pixels belonging to the building class and non-building class was close to the ideal 1:1 ratio (percentage difference << 0.5%) in every tile. Furthermore, the quality of the original datasets ensured the absence of uncertainty in labeling due to noise, low resolution, no-data pixels, inaccurate object edges, or class overlapping. Thus, the possible biases for the building extraction models were minimized, safeguarding the fairness of the test evaluation and the potential transferability of the learned knowledge.

Shallow ML-Based Building Detection
A shallow ML-based model that can classify pixels as building or non-building requires discriminative and computable engineered features for image segmentation and pixels classification.
Among the existing edge-and gradient-based descriptors, the HOG descriptors appeared especially suitable for detecting the construction footprints, as HOG detectors cue mainly on contours, can be computed quickly, and are fairly invariant to geometric transformations and occlusions [23].
The HOG feature extraction chain computes occurrences of gradient orientation in the detection window, or the region of interest (ROI) across the input, on a regular cell grid and uses overlapping local contrast normalization to enhance the accuracy. For every pixel in the input, the gradient vectors contain information on pixel value changes in x (g x ) and y (g y ) directions with respect to its four neighbors.
The attributes of the gradient are its magnitude M(x, y) = g x 2 +g y 2 (1) and its direction θ = arctan g y /g x .
Gradient information is then pooled into a 1-D histogram of orientations that can be used as input for ML algorithms.
Thus, the feature extractor encoded the raster inputs into feature vectors to feed a classifier for the object/non-building instance detection, namely ELM and SVM classifiers.
The first method for building detection under examination used HOG as an input to the SVM [60], a supervised ML algorithm that produces accurate classifications of remotely sensed data [28,[61][62][63]. The binary SVM classifier models a data point as a multidimensional vector belonging to one of two classes and constructs a hyperplane or set of hyperplanes to separate the classes [64] with the maximal margin (a space containing no observations) and the lowest misclassification [65][66][67].
The SVM classifier was fitted on the aforementioned processed training set to find the best-separating hyperplane (i.e., the decision boundary) using as input the predictor vectors X j along with their class labels Y i ∈ {+1, − 1}. During the training process, sequential minimal optimization [68] solved the constrained optimization problem by breaking it into a series of sub-problems that could be solved analytically. The final model was then saved to perform prediction on the independent test set in the evaluation phase.
The second method combines the HOG and an ELM [47] classifier instead of the SVM [46]. The ELM trains a shallow feedforward neural network with a single hidden layer, i.e., the feature mapping, which does not need parameter tuning [69][70][71]. The main advantages of the ELM method compared to the previous one include better scalability and similar generalization at a faster learning speed [72].
The ELM classifier was fitted on the same training subset with a single hidden layer of 1000 neurons and a training ratio value of 0.9 to find the combination of nodes, weights, and biases minimizing the error between the actual output of the network (predictions) and the expected one (the ground truth) and obtain a learned model to be used for the test evaluation. The net started with random input weights and calculated the best values using the root-mean-square error to assess the prediction accuracy.

DL-Based Building Detection
The third building detection model was based on a DL architecture. The FCN-based semantic segmentation [50] classifies the building or non-building class for each pixel directly within the image inputs. The FCN gives a pixel-wise output (label map) without needing a hand-engineered feature vector to feed the classifier for building extraction of remotely sensed data [7,42,[73][74][75].
The state-of-the-art classification convolutional model VGG-16 [76,77] structure was repurposed for the segmentation task employing the method fully described in [50]: adapting the original architecture into an FCN and transferring its learned representations by fine-tuning to the desired task. The repurposed architecture combines the semantic global information of deep, coarse layers to learn the feature hierarchy and the local knowledge of shallow, fine layers with 32-, 16-, and 8-pixel strides to improve the segmentation accuracy and enable pixel-wise predictions.
We fit the three models on a common portion of the dataset (the training set) by feature extraction and classifier training for the shallow ML-based models and fine-tuning through transfer learning for the DL model and then evaluated their discriminatory ability on the untouched data left (the test set).

Test Evaluation
The three implemented methods were fitted on a common dataset and then used to generate predictions on the same hold-out test set to obtan an unbiased estimate of each model's accuracy.
Given the building label as the positive class and non-building label as the negative class assigned to every single pixel of the raster images, the true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs) were counted to obtain the pixel-wise metrics for the evaluation of the binary classifiers [78,79] and a contingency map from a pixel-to-pixel comparison [32,80,81].
The TPs were the pixels correctly identified as building pixels; the FPs were nonbuilding pixels wrongly labeled as belonging to the building class. Similarly, the TNs were non-building pixels correctly classified, whereas the FNs were building pixels wrongly classified as non-building pixels (undetected building pixels). Totaling the number of TPs, TNs, FPs, and FNs added up to the total pixels of the test set.
Sensitivity, or recall, is the proportion of TPs, pixels that actually are positive (building class). Sensitivity takes values in the range (0, 1); with higher sensitivity, fewer building pixels are undetected (larger footprints).

Sensitivity or Recall
Specificity is the proportion of TNs, pixels that actually are negative (non-building class). Precision takes values in the range (0, 1); higher specificity leads to fewer pixels mislabeled as building.
The relationship between sensitivity and specificity was visualized using the receiver operating characteristic (ROC) curve and the area under the curve (AUC) [82] to quantify the performance of each classifier over its range of possible cut-offs (classification thresholds).
Precision or positive predicted value (PPV) is the proportion of relevant pixels (TP) among all the pixels classified as building pixels (positive) in the test. The PPV varies from 0 to 1, corresponding respectively to the value of the worst and the best classifier.
Similarly, the negative prediction value (NPV) represents the proportion of pixels with accurate non-building test results (TN) among all the pixels identified as non-building pixels (negative class). NPV reference values are 0 for the worst classification and 1 for the best classification possible.
The F1 score combines the precision and recall values by taking their harmonic mean. The F1 score takes values in the range (0, 1); higher F1 values correspond to better model performance.
The mean-square error (MSE) evaluates the mean of the quadratic prediction errors; lower values indicate better performance.
The Matthews correlation coefficient (MCC) effectively and reliably measures the quality of binary classifications [83], even if the classes are unbalanced, by taking into account true and false positives and negatives. MCC values range from the worst value −1 to the best value +1.

Results
We reported pixel-based metrics to evaluate the prediction error for each of the presented models for the building detection task in different urban areas. Figure 6 shows the contingency maps resulting from the pixel-wise results of the binary classification with the SVM, EML, and FCN superimposed on the aerial images of the considered areas: the building pixels correctly detected (TPs) are colored in yellow, FPs are colored in red, and FNs are colored in blue. As seen from examples in Figure 6, ELM tended to underestimate the pixels belonging to the building class, since a fairly high rate of FNs was produced. The resulting segmentation images were not close to reality in the Vaihingen and Toronto patches: in the first case, the smaller footprints were undetected; in the second case, there were "holes" in some footprints, having building pixels surrounding non-building areas, which is a characteristic of urban fabrics formed by specific architectural typologies (e.g., courtyard, siheyuan, and patio buildings). However, the SVM seemed in the Vaihingen case more prone to overestimating the number of building pixels due to the higher occurrence of FPs. The contingency maps produced by the FCN model showed a better correspondence of the predicted footprint position and size to ground-truth masks in all subsets. For quantitative evaluation of the different classifiers, we reported pixel-wise classification accuracy ( Table 1) in terms of sensitivity, specificity, precision, the NPV, F1 scores, the MSE, the MCC, and the AUC using both the aggregate metrics and the metrics split by area. Considering the Vaihingen test set, MCC values of the HOG-SVM and HOG-ELM models (0.657and 0.513, respectively) were significantly lower than the value of the FCN model (0.833); compared with the ELM classifier, the SVM was more prone to detecting building pixels as it disclosed higher sensitivity (77.3% vs. 41.5%), but the precision was lower (44.6% vs. 60.8%). The metrics relative to non-building pixels were not found to differ markedly. Considering the Potsdam test set, the HOG-SVM and HOG-ELM models showed similarly better outcomes in terms of sensitivity but were slightly outperformed by the FCN model; the difference became greater considering the MCC, a metric that provides a balanced measure of the relationship between reality and prediction. Considering the Toronto test set, however, the MCC values of the HOG-SVM and FCN models were significantly higher than the value of the HOG-ELM model (0.826 and 0.821 vs. 0.664), which suffered from lower sensitivity and NPV due to the presence of small clusters of FN pixels inside the clusters of pixels representing the building footprints, as shown in Figure 6.
Evaluating the aggregated metrics, the best values were achieved by the FCN-based classifier: the FCN-based building detection method exhibited good overall detection reliability. It was found that such an approach produces good-quality results in the different urban contexts considered, as demonstrated by the values of the F1 score (0.831), AUC (0.960), and MCC (0.834), which are fairly close to the ideal value of 1. Figures 7 and 8 illustrate the detection abilities of the three binary classifiers comparing ROC curves calculated on the output scores. The ROC curves of the SVM-based (blue lines), ELM-based (red lines), and FCN-based (orange line) classifiers plotted the dependency of the true-positive rate (sensitivity) on the false-positive rate (1-specificity) obtained at various thresholds.  Globally, all the final models could successfully detect the building footprints within the study areas using only DSM patches as input, presenting good predictive ability with AUC > 0.8 [84]. The

Discussion
The study compared the performance of three common supervised ML techniques-HOG with SVM, HOG with ELM, and FCN-for the task of building footprint extraction from high-resolution DSM data on a benchmark dataset of DSMs of three different urban contexts (Vaihingen an der Enz, Potsdam, and Toronto). Our results showed that all the ML techniques could successfully complete the task, but the FCN appeared more robust to urban fabric diversity.
The performance of both HOG-based models was influenced by the urban context investigated and in particular decreased within the Vaihingen area. This area is more challenging due to the complexity of the urban fabrics that are composed of many small buildings varying in shape, sparse low vegetation and trees, and irregular narrow road networks. The improvements in the Potsdam area could be attributed to the larger footprints of the constructions and the relatively reduced number of trees. The worst generalization ability of the two methods based on classical feature learning and classifier training may be caused by the size of the training dataset [72,85], confirming the worse performance of the ELM classifier compared to the SVM on small datasets [72].
In summary, these results show that a DL approach based on FCNs is the most preferable method as it achieves good classification regardless of the urban context, despite the inaccuracy in contours and boundaries that is a drawback of DL-based segmentation [86], and support evidence from previous studies [35,36,38,39,41,85]. Furthermore, such an approach automates the feature engineering and benefits from transfer learning that limits the impacts of the training data size.
However, with a limited data size, although sufficient for the comparison, caution must be applied, as the findings might not be fully appliable in the case of large datasets or urban scenarios with substantially different morphological structures, e.g., Asian cities [87]. Table 2 summarizes the main advantages and disadvantages of the three methods.

Conclusions
In this paper, we performed an empirical comparison of three different supervised ML-based building detection methods-HOG-SVM, HOG-ELM, and FCN models-on a benchmark high-resolution remotely sensed dataset using only raster DSMs. Two of these methods belong to the classical feature learning and classifier training category-shallow ML-whereas the third is a DL network. We used HOG as a feature descriptor and trained the classifiers (SVM and ELM) for the first two methods using publicly available ISPRS datasets. The same data were employed for fine-tuning the FCN architecture through transfer learning. The high quality of the publicly available data resolved the major problem of correctly labeling the training data, on which depend the predictive skills of the final trained models. The methods are easy to implement, as analogous functions are widely accessible in both proprietary and open source software and programming languages (e.g., MATLAB, Python, R).
Our results demonstrate that determining the footprint of buildings from remotely sensed data can produce different results, depending not only on the urban morphology of the context to be surveyed but also on the model choice.
The performances of building detection techniques based on shallow ML were affected by the complexity of the urban context considered, in particular by the presence of vegetation and smaller footprints. The FCN-based model has proven to be the most robust and best-performing method for building extraction from high-resolution DSM data. Furthermore, this DL technique can generate accurate building masks without any manually engineered features with high transferability potential. Due to this, the model has the potential to solve similar pixel classification tasks, such as extraction of a different class (e.g., ground surfaces, vegetation, cars) or multi-class segmentation by being re-trained with adequate ground-truth masks and classes number. Using solely a DSM as input data increases portability between urban areas as such data are widely available, continuously improved, and constantly released [88].
Future work could investigate the potentials in the aforementioned multi-class problem domains, i.e., multi-class semantic segmentation of urban areas, as well as a systematic analysis of the impacts on the accuracy of the raster resolution variation for applications in data-poor environments or at a larger scale. To better explore the robustness to urban fabric heterogeneity and geographic transferability, future experiments may also include the application in different kinds of settlements (e.g., informal, regional, and vernacular settlements) that can present different morphological characteristics. A metrics ensemble based on both raster resolution and urban morphology could also lead to a more complete characterization of the advantages and disadvantages of each algorithm. The knowledge of the urban context, analysis of urban objects, color information from orthophotos [34], and additional data could be used to develop novel methods for classification, pre-processing the inputs, or post-processing the outputs. Thus, studies on combinations of different approaches and data to improve the FCN detection accuracy would be worthwhile.

Data Availability Statement:
The used datasets were created by the ISPRS in the "Test Project on Urban Classification, 3-D Building Reconstruction, and Semantic Labeling" and can be required at https://www2.isprs.org/commissions/comm2/wg4/benchmark/ (accessed on 8 October 2019).

Conflicts of Interest:
The authors declare no conflict of interest.