Super-Resolution-based Snake Model -- An Unsupervised Method for Large-Scale Building Extraction using Airborne LiDAR Data and Optical Image

Automatic extraction of buildings in urban and residential scenes has become a subject of growing interest in the domain of photogrammetry and remote sensing, particularly since mid-1990s. Active contour model, colloquially known as snake model, has been studied to extract buildings from aerial and satellite imagery. However, this task is still very challenging due to the complexity of building size, shape, and its surrounding environment. This complexity leads to a major obstacle for carrying out a reliable large-scale building extraction, since the involved prior information and assumptions on building such as shape, size, and color cannot be generalized over large areas. This paper presents an efficient snake model to overcome such challenge, called Super-Resolution-based Snake Model (SRSM). The SRSM operates on high-resolution LiDAR-based elevation images -- called z-images -- generated by a super-resolution process applied to LiDAR data. The involved balloon force model is also improved to shrink or inflate adaptively, instead of inflating the snake continuously. This method is applicable for a large scale such as city scale and even larger, while having a high level of automation and not requiring any prior knowledge nor training data from the urban scenes (hence unsupervised). It achieves high overall accuracy when tested on various datasets. For instance, the proposed SRSM yields an average area-based Quality of 86.57% and object-based Quality of 81.60% on the ISPRS Vaihingen benchmark datasets. Compared to other methods using this benchmark dataset, this level of accuracy is highly desirable even for a supervised method. Similarly desirable outcomes are obtained when carrying out the proposed SRSM on the whole City of Quebec (total area of 656 km2), yielding an area-based Quality of 62.37% and an object-based Quality of 63.21%.


Motivation
Automatic and accurate extraction of building footprints from urban scenes using remote sensing data has become a subject of growing interest for a wide range of applications, such as urban planning [1], city digital twin construction [2], census studies [3], disaster and crisis management, namely earthquake and flood [4,5]. This research work addresses an effective solution for extracting buildings from urban and residential environments in a large scale. Such task plays an important role in the context of flood risk anticipation, which is asserted with a particular importance in the Quebec province, Canada [6]. Such context requires accurate and regularly updated building location and boundary, which enable the extraction of further essential structural and occupational characteristics of buildings (e.g. first floor, basement openings). In addition, the scalability of this solution-i.e. the ability to maintain its effectiveness when expanding from a local area to a large area [7]-is crucially important considering the scale of the study (i.e. at the scale of the Quebec province).
The nature of urban and residential environments can be very complex, where buildings can be found with various sizes, colors and shapes, within urban areas of different density and vegetation coverage. Such complexity is problematic for developing a large-scale building extraction solution. Indeed, a number of studies have been reported over the years with relatively significant results by assuming building shapes [8][9][10], enforcing geometrical constraints [11], or limiting on specific urban areas. However, such assumptions and constraints limit the scalability of the building extraction method, in particular over large areas composed of numerous and complex structures. Based on these premises, it is necessary that such solution is (i) versatile-applicable on different urban scenes without relying on predefined assumptions, constraints, or prior knowledge about the involved scenes and buildings; (ii) highly accurate; (iii) and easily scalable over large areas with a relative computational simplicity. To the best of our knowledge, such a solution has not yet been found.

Literature review
A large number of building extraction methods have been reported over the last few decades, particularly with the emergence of Light Detection and Ranging (LiDAR) systems since mid-1990s [12]. However, this task remains very challenging due to various difficulties. For instance, many works [13][14][15][16] have been carried out using aerial and satellite imagery. They face many problems due to occlusions, poor contrasts, shadows, and disadvantageous image perspectives [17]. Since height changes allow distinguishing urban objects more effectively than the spectral and textural changes from optical images, numerous works [18,19] proposed to exploit 3-D information from LiDAR to extract buildings. However, these methods usually face problems of misclassification of vegetation as buildings [20]. In addition, the accuracy of extracted boundaries can be compromised due to the LiDAR point cloud sparsity [21]. Therefore, many researchers have developed a consensus strategy to use multi-source data in order to increase the building detection rate. Hence, a number of studies [22,23] focusing on the integration of LiDAR and optical imagery data have been reported. They succeed at improving the building extraction accuracy, compared to the use of individual data source [24]. However, such approach of integrating multi-source data can be problematic due to data misalignment [25].
The International Society for Photogrammetry and Remote Sensing (ISPRS) Working Group II/4 "3D Scene Reconstruction and Analysis" provided a taxonomy for methods submitted to the urban object detection benchmark test [26], based on their processing strategy. Some of the methods are categorized as supervised methods requiring training data from LiDAR point cloud or optical image, such as Niemeyer et al. [27] and Chai [28]. They provided two of the highest accuracy methods submitted to the ISPRS Vaihingen benchmark. Many other methods are categorized as model-based methods, as they rely on an explicit model or a set of predefined rules on the appearance of the buildings in the data. For instance, Bayer et al. [29] proposed a segmentation-based method involving multiple thresholds applied on the Digital Surface Model (DSM) and Normalized Difference Vegetation Index (NDVI) to separate buildings and trees. Similarly, Grigillo et al. [30] proposed two versions of a model-based method based on rule-set classifiers on image pixel colors and NDVI. However, the selection of such thresholds and rules is strongly scene-dependent.
Active contour model [31], or colloquially known as snake model, is an object boundary extraction technique widely used in computer vision and image processing [32,Ch. 5]. This technique has also been intensively studied to extract buildings from urban and residential areas. In contrast to other approaches mentioned above, it provides a building extraction solution without prior knowledge about the image and the building shapes. Moreover, this technique provides a computational simplicity, and an advantageous flexibility allowing external constraint forces introduced by the user. These characteristics show that snake model is suitable to be developed into a large-scale solution that fits our purposes.

Snake model-based related works
Guo and Yasuoka [33] used snake model with balloon force to extract buildings using high-resolution satellite images and height data. Peng et al. [34] focused on improving the stability of snake convergence on aerial images. Kabolizade et al. [35] proposed a snake model using imagery data coupled with a DSM generated from LiDAR data. This model involves the minimization of variances of height and gray-level between snake points. Consequently, it requires height information for every pixel of the image, in other words, the DSM must be of the same size and resolution as the optical image. Such requirement is problematic since LiDAR datasets usually have subsampled spatial resolution compared to the aerial imagery, yet a simple interpolation of height data could be unreliable. In contrast, Ahmadi et al. [36] proposed a geometrical snake model to detect building boundaries from aerial images, without height information or manual initial points. But this model requires a priori gray levels of buildings and ground, and uses them as training data to attract the snakes toward desired buildings. Consequently, it yields a high number of mis-detected buildings when they consist of untrained color. Additionally, it does not work well with the building roofs having varying gray levels. Fazan and Dal Poz [37] proposed a method involving exhaustive searches for rectilinear building corners in the optical images, based on the basic snake model optimized by dynamic programing. Yet this method depends heavily on initial points to have decent results. Snake models have also been demonstrated as an efficient tool to refine the public Geographic Information System (GIS) building footprints [38]. The improved footprints are then feed into Convolutional Neural Networks (CNNs) as labeled data for the building segmentation.
Our previous work [39] presented an unsupervised and automatic snake model to extract buildings from optical imagery. It is carried out based on a snake model operating on optical image, initialized and enhanced by integrating with LiDAR data. This snake model involves a novel external energy term computed based on the shape similarity between the snake and the projected LiDAR building boundary. Such an energy term encourages the snake to maintain a shape similar to the building boundary extracted from LiDAR data, while moving under the attractions of salient features provided by optical image. In contrast to the snake models mentioned above, this method succeeds at extracting buildings in various difficult cases, e.g building roof with similar color to its background, gable-roof houses or varying-color roof buildings. Without any human intervention or training data, it is able to achieve higher accuracy than existing snake models and many existing building extraction methods such as [25,40,41] on multiple test areas (see [39] for the full assessment). Nevertheless, similarly to other existing snake models, it still concedes a number of challenges, namely its sensitivity against image noise and undesired details, and the hyperparameter tuning for snake model in a large scale.
While there is not currently any effective solution regarding the former problem (i.e. snake sensitivity) when using optical imagery, the latter problem (i.e. hyperparameter tuning) has been partially addressed by Marcos et al. [42] with a deep learning-based approach. It involves using a CNN to learn the characteristics of the snake model elements, i.e. parameters and energy terms, from training optical images and associated ground truth polygons. The CNN-inferred parameters and energy terms enabled this snake model to achieve higher accuracy compared to other deep learning-based building extraction methods. However, the main drawback of this method is that it involves every image patch-each one containing a building-to have the same size, i.e. 512 × 512 pixels, for both the training dataset and the test dataset. It means that all the concerned buildings (training and testing) must have similar size in order for the CNN to learn and predict the parameters and energy terms. In other words, in order to resolve the snake parametrization problem, this approach proposed by Marcos et al. requires the building size consensus. This requirement affects directly the method reproducibility on buildings of different sizes. Consequently, such CNN-based snake parametrization approach is not scalable for large areas consisting of buildings of various sizes.

Contribution
The objective of this research work is to develop a large-scale automatic and accurate building extraction based on snake model, fulfilling the following requirements. Firstly, such an effective snake model would require an automatic and reliable initialization. Secondly, the snake model should not be sensitive to noise and details in the image. Thirdly, the snake model parameters should be relevant when applied to a large extended area with buildings of various shapes, sizes, and colors. While the first requirement is addressed by using the boundaries preliminarily extracted from LiDAR point cloud, the second and the third remain very challenging. In this regard, the contributions of this work are threefold: • We propose an effective solution to compute the external energy for the snake model-which is initialized by the LiDAR-based boundaries. Such solution enables the snake model to be insensitive to image noise and details, as well as easing the snake model parametrization. In addition, this snake model involves an improved balloon force that behaves adaptively by either shrinking or inflating the snake (as opposed to the classic balloon force that always inflates it). • In order to build a reliable foundation for this novel snake model, a super-resolution process is proposed to reliably improve the LiDAR point cloud sparsity. Such sparsity issue has been problematic to building extraction methods using LiDAR data, including snake models. • Lastly, we present a comprehensive performance assessment of the proposed SRSM on two different geographical contexts, namely Europe (with the Vaihingen benchmark dataset) and North America (with the Quebec City dataset). Such contexts involve various differences in terms of compactness, density and regularity of urban areas [43], demonstrating the scalability and versatility of the proposed method.
Together, these elements constitute a large-scale automatic and unsupervised building extraction method, which achieves high thematic and geometrical accuracy when tested on various urban scenes.

Paper organization
This paper is structured as follows: this Section has been devoted to an introduction to the building extraction research topic, our motivation and a literature review of the related works. The contributions of this research work have also been summarized. Section 2 presents the proposed method. Then, multiple assessments on the performance of the SRSM involving various study areas and datasets are carried out in Section 3. Next, Section 4 brings the discussions on the relevance of the proposed SR, then on the SRSM results, and lastly on the impact of the snake model parametrization. Finally, Section 5 provides conclusions and perspectives of this work.

Proposed Method
This paper presents a novel unsupervised building extraction method, built around the super-resolution-based snake model (SRSM). Figure 1 depicts the flowchart of the proposed method.  First, the SRSM is automatically initialized by the preliminary candidate building boundaries extracted from the LiDAR point cloud. This extraction process is carried out as presented in [39]. Since LiDAR-based building extraction can be difficult due to nearby vegetation, this process also involves a vegetation removal based on the Normalized Difference Vegetation Index (NDVI) derived from an optical image [44]. As the two data sources are used jointly, a registration is necessary in order to avoid misalignment problems. This registration can be carried out a priori (i.e. data acquisition using the same platform) or a posteriori [45,46]. The 3-D building boundary points extracted from the LiDAR point cloud are denoted by B i , where i represents the building index. The registration results in a set of camera pose parameters θ, which is then used for the projection of the 3-D building boundary points B i onto the image space, denoted by P θ (B i ). Then, they are used as initial points (denoted by b 0 i ) for the snake model, as well as to generate the building masks (denoted by M i ) used in the balloon force. The SRSM operates on high-resolution LiDAR-based z-images generated by a super-resolution process. It also involves an improved balloon force model based on the building masks M i . The resulting building boundary is denoted by b i .

Mathematical formulation
An active contour, or a snake is a dynamic curve x(s) = (x(s), y(s)), where s ∈ [0, 1] is the normalized arc length, defined within an image domain that is deformable under the influence of internal and external forces. The behaviors of the snake are governed by an energy function defined as follows, where E int and E ext , respectively, represent the internal and external energy terms. The internal energy term relates to the amount of stretch and curvature of the snake, respectively controlled by weighting parameters α and β. Small value of α and β, respectively, encourage short and smooth contours, and vice versa. The external energy E ext is composed of the forces due to the image itself E img and other constraint 6 of 30 forces E con . The external image-based energy E img involving salient features of the image i.e. lines, edges and terminations (i.e. line segment end-points, corners) is formulated as follows, where w line , w edge , w term are the weights of the respective salient features. Mathematical formulation of these energy terms [31] are provided in Appendix A. A snake that minimizes E snake described by Equation (1) must satisfy the following Euler equation, In order to solve Equation (3), the snake is made dynamic by regarding x as a function of time t as well as of the arc length s. Then, the partial derivative of x with respect to t is then set equal to the left hand side of Equation (3), as follows, As x(s, t) stabilizes, the partial derivative term ∂x/∂t vanishes and a solution for Equation (3) is obtained. A numerical approach for Equation (4) can be carried out by discretizing the equation and solving the discrete problem iteratively [31]. External constraint forces are added to the snake energy function in order to guide it toward or away from a particular feature, as well as addressing snake problems such as initialization, convergence, robustness against noise. In this regard, Xu and Prince [47] proposed Gradient Vector Flow (GVF) to improve the traditional snake model by allowing more flexible initialization and encouraging its convergence to boundary concavities, as well as improving its robustness. GVF field is defined as the vector field v(x, y) = (u(x, y), v(x, y)) that minimizes the energy functional with µ GVF is a controllable smoothing term, and f represents external forces from Equation (3), i.e. f (x, y) = −E ext . Using [48] the GVF field v can be found by solving where ∇ 2 is the Laplacian operator. The Euler equations (6) can also be solved by regarding u and v as functions of time, Once computed v(x, y) replaces the potential force −∇E ext in the dynamic Equation (3), yielding ∂x ∂t This equation is solved similarly as the traditional snake model, i.e. by discretization and iterative solution. The parametric curve solving the above dynamic equation is thus called a GVF snake.
Cohen et al. proposed an inflation term as an external force, known as balloon model [49], as follows, where κ is the magnitude of the force and n(s) stands for the normal unitary vector of the curve at x(s). This model mimics the inflation of a balloon by continuously pushing the snake points outward. Thus it prevents the snake from shrinking into a single point.

Proposed z-image-based energy term
Despite the recent developments, the existing snake models still struggle to yield a satisfactory reproducibility in complex environments. Such low reproducibility stems from a number of reasons. For instance, these environments can be composed of complex structures such as multi-planar roof buildings which can also be shadowed or occluded by trees. For a multi-planar roof building, different roof planes can have different shades, causing ridge lines (i.e. the intersection lines between the different planes) exhibiting high-gradient values in the image-based energy term. In addition, the performance of the snake model on optical images can be affected by image small details, namely roof objects (like chimneys, attic windows), cars, trees, etc. There also are possible null-valued pixels on the orthoimage. Consequently, if a building involves these unwanted elements, then the snake model would be drawn toward them. Hence, the resulting performance on delineating such building would decrease significantly. Fortunately, these problems relate directly to the use of the optical image. Therefore, we propose to operate the snake model on the z-image derived from LiDAR data. This approach allows the snake model to focus only on the most salient features in a z-image, i.e. height changes involving off-terrain objects such as building and trees.

Generation of z-image by the super-resolution of LiDAR data
The accuracy of a building extraction method using LiDAR data is usually compromised by the sparsity problem [21]. Therefore, we propose a process dedicated to the projection and propagation of LiDAR data onto the image space in order to augment its spatial resolution. Such process is called super-resolution (SR), and illustrated by the flowchart in Figure 2. It consists in generating a z-image that contains the altitude values derived from the LiDAR 3-D point cloud. Such image has the same size and resolution as the optical image. The inputs of the SR process are the LiDAR point cloud, a set of camera pose parameters, the frame of reference and the size of the optical image. The LiDAR 3-D point cloud is denoted by ψ ∈ R m×3 where m is the number of points. Each point has three spatial coordinates (x, y, z). We also use ψ z ∈ R m for the column of altitude values. The z-image is denoted by φ ∈ R n x ×n y , where n x and n y are, respectively, the number of rows and columns. During the SR process, φ is vectorized into a column vector of n = n x × n y elements. The set of camera pose parameters θ results from the registration. It is used to define the projection of 3-D points onto the image space.

(a) Projection of LiDAR 3-D points
The first step of the SR process consists in projecting the LiDAR 3-D points onto the z-image space using the camera pose parameters θ. As the LiDAR point cloud is subsampled compared to the optical image, such a projection leads to a sparsity effect on the z-image φ. Here, we use Ω * and Ω to denote, respectively, the subset of the pixel indices in the z-image φ, having or not a projected altitude value. In other words, φ Ω * denotes the sparse z-image or the sub-vector containing the pixels of projected altitude value, whereas φ Ω denotes the sub-vector containing the null pixels. The dimension of φ Ω * and φ Ω , respectively, are m × 1 and (n − m) × 1. As such, φ = φ Ω∪Ω * is the vector containing all pixels, i.e. the whole z-image. The projection is mathematically presented as follows, where P θ is the 3-D projection associated with the camera pose parameters θ. The xand y-coordinates of the LiDAR 3-D points are used to locate the pixels in the z-image associated with such points. Next, the projected values indexed by Ω * will be propagated to their neighboring pixels (which are indexed by Ω).

(b) Propagation of the projected values
Our SR approach is inspired by the work of Castonera et al. [50] on the fusion of terrestrial LiDAR data with optical imagery. It involves reconstructing a sparse depth map by minimizing the sum of its squared directional gradients (SSDGs). This approach relies on hypothetical characteristics of a depth map, which involve the magnitude and occurrence of depth discontinuities inside the depth map to be minimized. In an airborne nadir view context, their method shows good performance in propagating elevation values across homogeneous regions. However, in elevation-discontinued transitioning regions, e.g. near the edges of a building, the propagated elevation values would be gradually flattened as a result of the minimized SSDGs. In other words, such hypothetical characteristics are not suitable in this context, where the off-terrain objects like trees and buildings always exhibit strong elevation discontinuities. Such discontinuities should be preserved during the value propagation process. Thus, an l 1 -norm term is added in our minimization approach. This preservation allows the resulting z-image to exhibit elevation changes as tight as possible compared to the scene reality.
The propagation of the projected values is carried out through the minimization of a cost function F (φ), defined by Equation (11). It is composed of the SSDGs and an l 1 -norm term of the z-image φ, subjecting to the values previously projected from the point cloud (i.e. described by Equation (10)).
where · p stands for the l p -norm, ∇ x and ∇ y , respectively, represent the directional gradient operators along the x-axis and y-axis. The parameter λ > 0 controls the amount of the l 1 -regularization.

(c) Propagation implementation
The minimization of the cost function described in Equation (11) is carried out using the Fast Iterative Shrinkage-Thresholding algorithm (FISTA) [51]. Its computational efficiency is adequate for solving large-scale problems, with a convergence rate of O(1/k 2 ), where k is the iteration counter. FISTA is significantly faster than standard gradient-based methods such as Iterative Shrinkage-Thresholding algorithms (ISTA). Full details on the implementation the proposed SR process can be found in [46].
The convergence rate of the SR is illustrated in Figure 3. Figure 3a depicts the differences between the estimated z-images at consecutive iterations, i.e.
The cost values F (φ (k) ) through iterations are shown in Figure 3b. One can observe that the z-image has nearly converged into a stable solution after approximately four hundred iterations. Figure 4 shows the outcomes of the projection and the propagation of altitude values from the LiDAR data onto the optical image space. The value projection First iteration without null-valued pixels and (b) cost function value F (φ (k) ) displayed as a function of iterations, from the SR process of generating z-image φ. The vertical red-dashed lines represent the first iteration where every pixels of the estimate z-image is filled.  outcome is depicted by the sparse z-image φ Ω * (Figure 4a), whereas the value propagation outcome is shown by the dense z-image φ (Figure 4b). The pixel color of the sparse and dense z-images represents the surface elevation in meters. Figure 4c shows the reference optical image on the same urban scene, in order to assess visually the quality of the super-resolved z-image. It can be noted that the elevation of buildings and other objects (e.g. trees, cars, etc.) are well presented on the dense z-image (Figure 4b), and correspond to the information in the optical image. The proposed SR process is shown to achieve the purpose of augmenting the spatial resolution of the LiDAR point cloud. An assessment of the SR performance will be presented in sub-section 3.3.  Figure 5b and 5c show, respectively, the z-image and the energy term computed from the z-image. Then, the optical image and the associated energy term are depicted, respectively, in Figure 5d and 5e. In Figure 5c and 5e, the grayscale reflects the value of the energy term E img . The dark pixels represent the low-energy pixels, whereas the bright pixels are the high-energy ones. By design, a snake is attracted to the dark pixels, and will iteratively move toward them. Comparing the two energy terms (Figure 5c and 5e), it can be noted that the sources of attraction for the snake models, i.e. the dark pixels in the energy term, provided by the z-image are more relevant than the ones from the optical image. Indeed, the dark pixels from the energy term computed from the z-image E img (φ) (Figure 5c) are found mainly at the edges of the building, with a few exceptions caused by the nearby trees. There exists also one particular aberration which is circled in red on Figure 5c. It is caused by an absence of LiDAR points in this small region, as highlighted in the red circle on Figure  5a. On the other hand, the dark pixels from the optical image-based energy term E img (I) (Figure 5e) stem from many undesirable artifacts, namely cars, attic windows and chimneys found in this building roof. They are highlighted by the red circles on Figure 5e. By comparing Figure 5c and 5e, it can be noted that some of these artifacts do not exist, or are much less visible on the z-image-due to their low elevation variations compared to the building-to-ground ones. Also, the effect of shadow casting over a corner of the building-circled in green on Figure 5e-is also problematic to a building boundary extraction. Such effect and problem do not exist in the z-image. We can also remark that this multi-planar roof building exhibit many ridge lines which also produce low values in the energy term. In reality, they are not false non-building details like trees or cars, but when focusing on the extraction of the building boundaries, they can be considered undesirable. These premises show that it is more relevant to carry out the snake model on the z-image than on the optical image.

Improved balloon force
As aforementioned, the classical balloon force is conceived to constantly push the snake outward based on its local curvature (Equation (9)). Such behavior becomes less relevant when addressing buildings with complex shape. Therefore, we propose to adapt the balloon force to push outward at some particular region and shrink inward at some others. Such adaptation is explained in the following. Using the 3-D building boundaries preliminarily extracted from LiDAR point cloud, a set of building masks M i can be created. Such masks are generated by projecting the 3-D building boundaries onto the image space, and then determining the enclosed region inside the projected boundaries. Then, the adapted behavior is carried out through a signed magnitude matrix computed using the mask M i , as defined in Equation (12). Figure 6. Illustration of the balloon force on a rectangle building. (a) Original balloon force continuously inflating; (b) Improved balloon force behaviors, adjusted based on the snake local curvature and its relation to the LiDAR-based building mask (blue rectangle). The red arrows represent the balloon force applied to the snake points, moving from the current iteration (solid line) to the next one (dashed line).
with (x, y) is the coordinates of a point on the snake, and κ is the force magnitude weight. As a result, the improved balloon force for a building i is given as in Equation (13), where n stands for the normal vector of the curve at (x, y). Figure 6 depicts a schematic representation illustrating the proposed adaptation. It demonstrates how the snake behaves differently (i.e. either inflating or shrinking) based on its relation with the given LiDAR-based building mask. With the LiDAR-based building masks M i represented by the blue rectangle, the balloon force behavior (represented by the red arrows) is improve to shrink or inflate at different snake points.

Experimental Results
In this Section, multiple performance evaluations are carried out. First, we introduce the building extraction accuracy metrics, as well as the study areas and the datasets used in this work. Then, the performance of the SR process is evaluated. Next, a visual assessment between the snake models is also carried out. Lastly, the proposed SRSM is evaluated on various urban and residential scenes.

Building extraction accuracy metrics
Multiple accuracy assessments, thematically and geometrically, are proposed to evaluate the performance of a building extraction method based on the ground truth boundaries.

Thematic accuracy metrics
Based on the evaluation methodology described by Rutzinger et al. [52], three metrics, namely Quality (Q), Completeness (Cp) and Correctness (Cr), are measured per-object and per-area. Particularly, the per-object evaluation involves either all objects regardless of their area, or only the objects with an area larger than 50 m 2 . The three metrics are computed based on the count of true positive (TP), false positive (FP), false negative (FN) elements between the extracted and the reference building boundaries from the ground truth. These elements (TP, FP, FN) are defined differently if the evaluation is carried out per-object or per-area.
For the per-object evaluation, an extracted building is counted as a TP if at least 50% of its area coincide with its ground truth. On the other hand, a FP is an extracted building without a corresponding building in the ground truth, or if the coincided area with the ground truth is less than 50%. Whereas a FN means the proposed approach fails to extract a building existing in the ground truth. The corresponding Cp, Cr, Q metrics are then computed using Equation (14).
For the per-area evaluation, such metrics are computed using the count of pixels on the image. The area-based Quality Q is equal to the Intersection over Union (IoU) metric, which measures the ratio between the intersection area over the union area of the extracted building boundary E and the corresponding ground-truth R (Equation (15)). It reflects the overall accuracy of the building extraction method according to the ground truth. The Completeness Cp measures the fraction of relevant identified building pixels over the total number of actual building pixels, whereas the Correctness Cr computes the fraction of relevant identified building pixels among all identified pixels.
where #(·) denotes the number of pixels inside the given region. All three metrics Cp, Cr and Q reach their best value at 100% and worst at 0%.

Geometrical accuracy metrics
The geometrical accuracy of the method can also be evaluated by measuring the root-mean-square error (RMSE) of distances from extracted building outlines to the reference outlines, without considering points with distance greater than 3 meters. Such threshold is defined by the assessment methodology [26]. A smaller distance indicates a better geometrical accuracy.

Vaihingen dataset
The proposed building extraction method is tested using the ISPRS benchmark dataset on Vaihingen, Germany [53]. The test aims to demonstrate its effectiveness on complex environments, and to compare it with other methods. The ISPRS Vaihingen benchmark dataset involves three test areas consisting of buildings with diversified characteristics. In these test areas, the ground truth boundaries consisting of roof outline polygons were generated based on manual stereo plotting, with an associated planimetric accuracy of approximately 10 cm [26]. The columns two and three of the Table 1 describe the involved LiDAR and optical imagery datasets on these areas. Concerning the LiDAR data, we only use the data from one strip for each area. The orthoimage was generated based on the DSM derived from the LiDAR data. As a result, the misalignment between them is relatively small (i.e. less than 30 cm).

Quebec City dataset
Beside the assessments on the Vaihingen dataset representing a European urban context, we additionally conduct a performance assessment in another geographic context, namely North America. In this regard, the method is carried out on the urban areas of Quebec City, QC, Canada. They cover a total area of 656 square kilometers. The whole area is divided into tiles of 1 km × 1 km, as shown in Figure 7, for the sake of processing time and memory constraint. The involved LiDAR and optical imagery datasets are described in the column four and five of Table 1. The ground truth boundaries of buildings in the whole test area are provided and updated monthly by the City of Quebec, named Empreintes des bâtiments 1 . The ground truth dataset used in this work was downloaded on March 4 th , 2019. The dataset covers all the Quebec City territory with more than two hundred thousand building footprints. This work is carried out using a deep neural network, of which the foundation is the ResNet-34 [54]. The training set consists of three million labeled Bing images. In this paper, we conduct an individual performance assessment using the mentioned ground truth building boundaries in Quebec City, in order to compare with the SRSM results.
It should be noted that this assessment does not only allow evaluating the performance of the proposed SRSM on such a large dataset, but it also serves as an example demonstrating the scale of the study-i.e. the Quebec province, in which other cities and large areas should not cause any adaptability problem on such an unsupervised method.

Performance evaluation of the super-resolution
Beside the visual assessment provided in 2.2.1, the performance of the proposed SR process is also quantitatively evaluated. We compare it with other conventional 2-D interpolation methods, namely nearest neighbor (NN), bilinear and natural interpolation [55]. This evaluation and comparison are depicted by Figure 8. The four methods are examined on a real LiDAR point cloud with an average density of 3.8 points/m 2 (Figure 8a). Such point cloud is then subsampled by a chosen factor, namely 2, 4 and 8, yielding a subsampled point cloud which serves as an input for these SR/interpolation methods. These experimented factors are chosen based on the proportion between the respective spatial resolution of the datasets (cf. Table 1). For example, Figure 8b depicts the 3-D point cloud subsampled by a factor of 2. Based on the sparse DSM generated from this subsampled point cloud (Figure 8c), each interpolation method generates a DSM having the spatial resolution equal to that of the subsampled LiDAR point cloud times the upscaling factor-in other words, equivalent to the spatial resolution of the original point cloud. The resulting interpolated DSM provided by each method (e.g. Figure 8d or 8e) is compared with the DSM generated from the full-resolution LiDAR point cloud, which is considered as the ground truth for the assessment (Figure 8f). In order to evaluate the quality of these interpolation and SR methods-i.e. the closeness between the interpolated image and the ground truth image-we measure the following metrics: root-mean-square error (RMSE), structural similarity (SSIM) [56], and the peak signal-to-noise ratio (PSNR). SSIM and PSNR are two widely used objective metrics for evaluating image super-resolution quality [57]. Their mathematical explanations can be found in Appendix B. Table 2 summarizes the quality measurements of each interpolation method for all three upscaling factors, i.e. ×2, ×4 and ×8. Overall, compared to the other methods, the proposed SR process yields better results, i.e. smaller RMSE, higher SSIM and PSNR. However, it yields a disadvantageous SSIM compared to the natural interpolation and the bilinear interpolation, in the ×4 and ×8 upscaling. Considering the RMSE, one can remark that the improvement in the case of ×2 upscaling between the proposed SR and the others is only marginal (i.e. 1.96 compared to 2.00-2.18). In contrast, in the ×4 and ×8 upscaling, this margin of RMSE improvement becomes more significant. Similar remarks can be made when considering the PSNR. These improved quality measures show that the proposed SR is more reliable compared to the conventional interpolation methods. This quantitative assessment and the visual assessment (previously presented in 2.2.1) have demonstrated the relevance of the proposed SR method. It is deemed to fit the purpose to be used in the proposed SRSM.

Comparison between snake models
We also perform an assessment on the performance of the proposed SRSM, and compare it with other existing snake models previously mentioned in sub-section 1.3. They are carried out on the gable-roof building previously discussed in sub-section 2.2 and displayed in Figure 5. First, the ground truth building region is overlaid by a transparent green area, while the surrounding ground is displayed in transparent red color, as in Figure 9b. These overlaying colors allow to assess the building extraction more straightforwardly. Figure 9c presents the result of snake models on the exemplified building. Four snake models are compared, namely basic snake with GVF, snake model of Guo and Yasuoka [33], snake model of Kabolizade et al. [35], and the proposed SRSM. They all are unsupervised snake models, which are substantially different from each other. In addition, they are not constrained to one particular range of building sizes. These also are the reasons why the snake model-based methods [36,37,42] are not involved in this comparison. The snake parameters are set as follows, α = β = 0.2, balloon force magnitude κ = 0.1, and image-based energy term weights w line = 0.04, w edge = 2, w term = 0.01.
Here, all concerned snake models are initialized by the same LiDAR-based building boundary. These initial points are already an improvement compared to the ones proposed in the respective snake model. However, one can remark in Figure 9c that the other snake models (i.e. the basic snake, the snake model of Guo and Yasuoka, and the snake model of Kabolizade et al.) have problems approaching the true edges of the building. On the other hand, the proposed SRSM, under the influence from salient features of the z-image, converges very well towards the edges and the corners. As we previously stated in 2.2, this exemplified building involves many challenging regions. One of them is the building corner under shadow (circled in green in Figure 5e), which is now shown in the left sub-figure in Figure 9c. Another difficult region is found with many nearby cars (circled in red in Figure 5e). It is now zoomed in the bottom-right sub-figure in Figure 9c. It is shown that, on these two corner regions, all three of the previous snake models yield poor results. In contrast, the proposed SRSM approaches very well the ground truth boundary. This visual assessment shows that the z-image-based snake model yields much more accurate building boundary compared to other existing snake models.  Table 3 summarizes the quantitative results of the compared snake models, based on the area-based Quality and RMSE metrics. First of all, one can remark that, in general, the resulting Quality and RSME from all snake models are relatively high (more than 70%). This stems from the benefit of using the LiDAR-based building boundary as initial points. Then, the quantitative results also show the relevance of the proposed snake models, compared to other snake models. However, the margin of gain between them according to these values-i.e. a maximum Quality gain of 9.33% between the basic and the SRSM-is not as high as expected, given the clear advantage drawn from the visual assessment from Figure 9c.
One can also remark that the snake models were not able to extract two particular parts of the building (highlighted by yellow-dashed circles in Figure 9b) because they do not exhibit significant elevation change or color change from the surrounding ground (cf. Figure 5). On one hand, the inability of the SRSM stems from the absence of elevation changes. On the other hand, the other snake models are unable to extract these parts because of the absence of color changes. We are convinced that these undetected parts can be the reason of the low margin between the snake models mentioned above. Therefore, we also conduct another evaluation of all four snake models with a modified version of the ground truth building boundary, in which the two undetected parts are removed. Such modified ground truth boundary aims to provide the unbiased reference for the snake models. In Figure 9b, this modified ground truth boundary is depicted in blue outlines. The columns 4 and 5 of Table 3 reveals the involved comparison based on this modified ground truth. As expected, the new margin between the proposed snake model and the others is now much larger, i.e. a margin of 21.21% of Quality between the basic snake model and the proposed SRSM. It is coherent with the inference drawn from the visual assessment (Figure 9c). This comparison has shown that the proposed SRSM yields better accuracy than the other snake models. In the next two sub-sections, the overall performance of the SRSM on different datasets will be assessed.

Performance on ISPRS Vaihingen dataset
The three test areas of the ISPRS Vaihingen dataset are shown by Figure 10. Area 1 (Figure 10a) is situated in the center of the city and characterized by dense construction consisting of historic buildings with rather complex shapes. Area 2 (Figure 10b) is composed of high-rise residential buildings surrounded by trees. Lastly, area 3 ( Figure 10c) is residential with detached houses and many surrounding trees. The results of SRSM also are depicted in Figure 10a, 10b and 10c in green. Then Figure 10d, 10e and 10f illustrate the area-based accuracy assessment, denoting TP (in yellow), FP (in red), and FN (in blue) pixels. Overall, the proposed method yields a very high accuracy, reflecting by very high number of TPs on all three areas. However, a number of unresolved problems can be remarked in Figure 10. Firstly, many FP pixels can still be noted in all three areas. They relate to the problem of shadowed tree regions near buildings. Such tree regions are circled in green in Figures 10d, 10e and 10f. An example of this problem is from area 2, which is shown by Figure 11a. Secondly, several small buildings from all three areas have not been detected.    Table 4 shows the resulting geometrical accuracy of the SRSM. It can be noted that the RMSE on the area 3 is much lower than the other areas. It is due to the fact that area 3 are composed mostly rectangular buildings and less complex than the other areas.  Table 5 presents the resulting object-based accuracy of the SRSM. The columns two to four show the accuracy metrics on all objects, whereas the columns five to seven provide the metrics when considering only objects with an area larger than 50 m 2 . The differences between these two results reflect the aforementioned problem of undetected small buildings. In addition, several buildings have only been partially extracted, due to the fact that the non-extracted parts have very similar elevation to their surrounding area. Particularly, one of these building parts (circled in magenta in Figure 10e), in reality, is the roof of a basement covered by vegetation in the area 2, as shown in Figure 11b. This part has not been extracted by any existing building extraction methods submitted to this benchmark [26]. In the area 3, there is also one building (circled in cyan in Figure 10f) that is very poorly extracted. This problem is caused by the incompleteness of LiDAR data on this building, as shown in Figure 11c.
The resulting accuracy of the SRSM is then compared with other works submitted to the ISPRS Vaihingen benchmark portal 3 . As of January 26th 2020, there are currently 42 submitted methods. Four histograms are shown by Figure 12, summarizing the resulting accuracy of these methods averaged on the three areas. Each histogram shows the distribution of methods according to the area-based Quality (Figure 12a), object-based Quality (Figure 12b), object-based Quality for objects larger than 50 m 2 ( Figure  12c), and RMSE (Figure 12d). All four histogram are presented with their bins sorted in an increasing quality order, i.e. any particular bin involves a higher quality (i.e. higher Quality percentage and lower RMSE) than the bins on its left. Through these histograms, one can remark the developed consensus on results of the state-of-the-art methods. For instance, from Figure 12a, it is shown that the majority of the methods (i.e. approximately 62% or 26/42 methods) yield an area-based Quality ranging from 82.5% to 89.8%. On the other hand, 64% of the methods yield an object-based Quality ranging from 73.8% to 87.2% (Figure 12b). For object-based Quality for buildings larger than 50 m 2 (Figure 12c), a result greater than 97.6% is desirable, considering that 62% of the methods are capable of yielding such outcome. As a fully unsupervised and automatic building extraction method, our method yields very high accuracy. Indeed, considering the resulting average area-based and object-based Quality, respectively 86.57% and 81.60%, our method is placed among the top 20% of all benchmark methods, i.e. the 10 th or 9 th among 42 methods. These results are highly desirable compared to other existing methods. It is also worth-noting that many among the top-accuracy methods are supervised or model-based methods [26]. Indeed, the supervised methods proposed by Niemeyer et al. [27] and Chai [28], result in area-based Quality of, respectively, 87.8% and 89.7%. The model-based methods proposed by Bayer et al. [29] yields an area-based Quality of 89.8%, and the two versions of a method by Grigillo et al. [30] yield an area-based Quality of 89.4% and 89.7%. In addition, considering the object-based Quality for buildings with an area larger than 50 m 2 , our method is placed 12 th among 42 methods. However, considering the RMSE (Figure  12d), our method yields a result (averaging 1.09 m) among the highest RMSE, in other words, the least desirable. Future works will concentrate on improving such accuracy.
The proposed SRSM also faces several problems when performed on the ISPRS Vaihingen benchmark dataset, such as the problem of nearby shadowed vegetation shown in Figure 11a. Grigillo et al. [30] proposed to solve such problem with the rule-set classifiers on image pixel colors and NDVI. However, this approach involves multiple manually selected thresholds which requires a high level of supervision. There also exists other classification approaches (in order to better classify shadowed trees from buildings) involving graph-cut-based method [58]. However, such method may require a high amount of a priori information or user inputs in order to yield accurate results [32]. Therefore, by opting for such mentioned approaches, the level of supervision of the building extraction method should be reconsidered.

Performance on Quebec City
In order to test the performance and applicability of the proposed SRSM on a large scale, we carry it out on the Quebec City dataset. Many areas on Quebec City are composed of different types of urban, residential and industrial scenes. Two of these typical scenes are shown in Figure 13. They are also representative of the North American context. Based on a visual assessment, the SRSM succeeds at delineating accurately the building boundaries on the two exemplified scenes. Typically, the size of the buildings shown in both scenes varies greatly from small to very large buildings. One can remark that many buildings have similar color as their background (i.e. parking lots, open areas, etc.) are also well delineated. Other optical image-related problems such as roof objects, nearby cars are also avoided. This reemphasizes the benefits of using the z-images encoding LiDAR elevation data instead of the optical images. In addition, similar to the Vaihingen datasets (particularly area 1 and 2), the shape of buildings presented in these two examples-also verified across the whole Quebec City area-can be very complex. These three factors related to the scene complexity-i.e. varying building size, color, and shape-can be problematic to other methods, whereas the proposed SRSM is able to overcome such complexity. Table 6 summarizes the area-based and object-based accuracy yielded by the Microsoft open Canada building footprints, and the proposed SRSM. It can be noted that the Completeness and Correctness yielded by the two methods are quite different. These differences mainly stem from the fact that the two methods were carried out using different data sources with different characteristics. However, based on the resulting Quality values reflecting the overall accuracy, it can be noted that the SRSM provides a competitive outcome compared to the Microsoft method. Indeed, the Quality margins between the SRSM and the Microsoft method are well balanced. The SRSM yields a 6.65% higher object-based Quality, while in contrast, the Microsoft method provides a 7.40% higher area-based Quality. On the one hand, the difference of the area-based Quality stems from the fact that the resulting footprints from SRSM have the tendency to be slightly "rounded" around the building corners. Whereas the Microsoft footprints were generated (with their own polygonization method) without such problem. On the other hand, the SRSM with the advantage of the z-images encoding elevation data allows to detect the buildings more precisely, hence yielding the higher object-based Quality. Nevertheless, it is always worth-noting that such competitive accuracy is produced by an unsupervised approach, compared to the heavily supervised approach from Microsoft which was trained on three million labeled images. The complete dataset of extracted building boundaries in Quebec City by the SRSM, as well as the high-resolution version of Figure 13, are made publicly available at https://github.com/nthuy190991/SRSM_QuebecCity_building_extraction. (c) (d) Figure 13. SRSM results (in red outlines) on typical urban and residential areas in Quebec City, and the corresponding ground truth. Each example covers a 1 km ×1 km area. The outcomes of the SRSM on the Quebec City dataset are relevant, visually and quantitatively. However, there still remain two issues. Firstly, from a practical perspective, the SRSM was carried out separately on tiles (Figure 7) for the sake of processing time and memory constraint. Then, the tile-based results were combined in QGIS. Such step is crucial for the buildings located in the transitioning areas between two neighboring tiles. Several of those buildings can be identified near the borders of the tiles shown in Figure 13. Secondly, the SRSM is unable to separate connected or nearby buildings with similar height. Given the z-images involves only elevation information, such a separation task can be difficult. Therefore, we shall investigate the usefulness of other information for such task. Overall, these two issues can affect unfavorably the resulting accuracy of the SRSM. Future efforts will concentrate on addressing these two issues to improve the SRSM results.

Discussions
In this Section, three discussions are drawn to the attention: (i) on the relevance of the proposed SR, (ii) on the SRSM results, and (iii) on the impact of the snake model parametrization.

Relevance of the super-resolution
As suggested by the name of the proposed method (i.e. SRSM), the SR process plays a critical role. However, such process is not only relevant for snake models. Indeed, the need and potential of such process to enhance the spatial resolution of LiDAR data is high. For instance, in the topic of building extraction, several methods [16,38] proposed to replace the blue channel of RGB images with a normalized DSM (nDSM). Such composite image-i.e. red, green, and nDSM-are then feed into deep neural networks for extracting buildings. However, these approaches did not account for the fact that the two input images-the RGB image and the nDSM-usually have different resolution, hence a SR was not proposed. On the other hand, the SR process could resolve one of the problems of the snake model proposed by Kabolizade et al. [35] (cf. sub-section 1.3). A super-resolved DSM could improve the height variance-based external energy term proposed in their work. However, it is worth-noting that the main drawback of their snake model is still the use of optical image as the target image, i.e. for computing the E img . In other topics, the study of SR applied to LiDAR depth measurements is also very active. Indeed, a reliable SR would benefit many applications, such as calibration for autonomous driving [50], or land cover classification [59].

Discussion on the SRSM resulting footprints
The accuracy level of the SRSM results carried out on the Vaihingen dataset and the Quebec City dataset have been shown-through multiple assessments and comparisons-to be desirable. It has achieved our objectives for a large-scale high-accuracy building extraction method, without any assumptions on the building characteristics, nor any training data. However, two important aspects concerning the building footprints provided by the proposed method should be discussed. First, it can be noted from the results in Vaihingen ( Figure 10) and Quebec City (Figure 13) that, the resulting snakes have the tendency to be slightly "rounded" around building corners. Such problem can be addressed with an efficient polygonization method. However, such step can be quite challenging considering the complexity of building shape on the two study areas.
The second worth-mentioning aspect involves the acquisition time difference between the LiDAR data, the optical image data, and the reference ground truth boundaries. On the one hand, considering a benchmark dataset like the ISPRS Vaihingen dataset, such aspect is minimal since the data were acquired almost concurrently (cf. Table 1). In addition, the Vaihingen ground truth building boundaries were prepared using the same data. On the other hand, considering the large scale of Quebec City, such temporal aspect is much more complicated. Firstly, the LiDAR data were acquired one year after the optical images. Secondly, the Empreintes des bâtiments dataset consisting of the ground truth building boundaries was produced using multiple different sources, and updated monthly. Thirdly, the comparative Microsoft results were carried out using Bing Imagery data. Since Bing Imagery is a composite of multiple sources, we are unable to determine the exact dates for individual pieces of data 4 . Such temporal difference and uncertainty can affect the building extraction accuracy. This issue requires a dedicated study in order to account for all of the involved factors.

Impacts of snake parametrization
A snake model involves a number of parameters, such as α, β, κ (the balloon force magnitude), µ GVF (the GVF smoothing parameter), etc. In the existing models [34][35][36], these parameters have been set empirically in order to extract buildings effectively. The snake parametrization becomes extremely difficult over a large extended area. However, some parameters are more important than the others. In this regard, Marcos et al. [42] partially addressed such problem with a CNN-based approach. It involves learning the characteristics of the most important elements of the snake model, namely the snake internal energy term weights (α and β), the image-based energy term (E img ), and the balloon force (F balloon ). Additionally, they asserted that one scalar value of β for all parts of a building can lead to problems of over-smoothing at building corners, and under-smoothing at other regions. To avoid such problem, they proposed a local penalization approach, by assigning a different β penalization to each pixel depending on whether the pixels are near the building edges or corners, whereas α remains scalar for every pixel. In this discussion, let us analyze the relevance of such parametrization approach, and compare with our fixed parametrization for the SRSM. The characteristics of the CNN-inferred energy terms and parameters differ with respect to the features from the optical image (e.g. building corners, edges, etc.), as summarized by Table 7. Table 7. Characteristics of the CNN-inferred balloon force term F balloon , image-based energy term E img and snake curvature weight β among the optical image features (resulted by [42]). E img can have either positive or negative values, whereas F balloon ≥ 0 and β ≥ 0. Firstly, concerning the balloon force, the second column of Table 7 shows the characteristics of the balloon force inferred by the CNN. If a snake is initialized inside a building boundary, the balloon force-being positive-will inflate it outward until it reach the building corners and edges. Then, the balloon force sharply drops to zero and remains zero right outside the building boundary, which means that the snake is not allowed to inflate anymore. However, if the snake is provided with initial points outside the building boundary, the balloon force-being null-valued-is unable to shrink inward to approach the building true boundaries. Such behavior is not optimal. In contrast, the approach to generate F balloon proposed in this paper based on the LiDAR-based building mask is more relevant. It allows the snake to be shrank or inflated adaptively, regardless where it is initialized without relying on any learning process.

CNN-inferred energy terms and parameter
Secondly, we address the image-based energy term E img . The characteristics of the CNN-inferred image-based energy term E img are revealed in Table 7. However, they are similar to those exhibited by the traditional snake model mathematical approach (cf. Equation (2)). Such similarity is illustrated by Figure  14. The energy term E img of a rectangular building (Figure 14a) is mathematically computed and shown in Figure 14b. As illustrated, the building edges and corners exhibit negative to very negative values, whereas the pixels inside and outside of the building exhibit positive values. Such characteristics among building features are analogous to the CNN-based approach. Therefore, in the proposed SRSM, the energy term E img is retained as in the traditional snake model. The sole change is that the target image is the z-image instead of the optical image. Such change is relevant as the z-image-based energy term provides more desirable features-i.e. height changes instead of color changes.
(a) (b) Figure 14. Image-based energy term E img of a rectangular building with a color-consistent roof.
Lastly, concerning the snake curvature weight β, we retain the use of a fixed scalar β in our method. The immediate reason is that without a training phase, the generation of a different β value for each pixel is difficult, or even virtually impossible. In addition, as we changed the target image of the snake model, the needed dynamics for β should also change. Since the only sources of attraction for the snake model are now the height changes from off-terrain objects, the snake curvature does not need to be different pixel to pixel. The snake should be able to correct itself from such sources of attraction. A comparison is conducted to confirm whether using the CNN-inferred pixel-wise β would bring a real benefit compared with a fixed scalar β. As such, the SRSM is experimented where the value of α and β are, either inferred from CNN as in [42] or set to fixed scalar values. Such comparison is carried out on seven buildings in the proximity of the area 1 (ISPRS benchmark dataset) selected by Marcos et al. [42]. One of these buildings is exemplified in Figure 15. The optical image and the initial points for SRSM in blue are revealed in Figure 15a, whereas the z-image is shown in Figure 15b. The SRSM carried out with the CNN-inferred α and β results in the building boundary in green (Figure 15a). The CNN-inferred value of α is 0.767, whereas the image of β values (each pixels with a different β value) is shown by Figure 15c. Then, the SRSM carried out using the scalar α and β-both set equal to 0.2-yields the red building boundary (Figure 15a). The two snakes in red and in green are shown to be similar. Quantitatively, the area-based Quality provided by the CNN-based approach on all seven buildings averages 73.62%, whereas the fixed scalar parametrization approach yields 72.03%. By visual and quantitative assessment, it is shown that the CNN-inferred approach as well as the pixel-wise β does not bring a practical benefit to our SRSM. In summary, the proposed SRSM succeeds in providing a relevant solution, regarding all three main aspects of the snake parameterization. Indeed, since almost every building exhibits a strong elevation variation with respect to its surrounding area, the characteristics of building appearances on their respective z-image should be all similar. As a result, the proposed SRSM can be generalized with the same set of influential parameters on buildings of various size and shape, as well as in complex environments.

Conclusions
In this paper, we proposed and evaluated an unsupervised and automatic building extraction method dedicated to a large-scale urban scene. This method is built around an efficient snake model, named SRSM. First, a preliminary extraction of building boundaries from the LiDAR point cloud is carried out. These boundaries are used as initial points for the SRSM, as well as in the improved balloon force. Second, in order to resolve the sparsity problem related to the LiDAR data spatial resolution compared to an optical imagery dataset [21], we propose a super-resolution process. Such process is devoted to the projection and propagation of LiDAR data onto the image space, enabling to augment its spatial resolution. Then, the snake model is carried out based on the resulting z-images. Such z-images encoding LiDAR elevation data are highly beneficial since the height changes provide more reliable cue for extracting buildings than the spectral and textural changes provided by the optical images. In addition to such benefit, the useful elevation data are now provided with high spatial resolution. Third, the balloon force is improved to behave more adaptively compared to the classical balloon force.
By using the z-image, a number of typical problems related to the optical image have also been addressed. Until now, all of the existing snake models have conceded the sensitivity problem against image noises and details, such as roof objects and nearby cars and trees. Such scene elements prompt undesired sources of attraction, causing the snake model unable to converge toward the true building edges. Operating on the z-image which only exhibits significant height changes, the SRSM is provided with relevant sources of attraction. In addition, such fundamental replacement-i.e. using the z-image instead of the optical image-also affects the parametrization of the snake model. Indeed, the need for a hyperparameter tuning, e.g. by a deep learning approach [42], becomes less substantial. Thus, the SRSM is parametrized with fixed scalar values. By the virtue of the proposed improvements, such static parametrization does not restrain the applicability and scalability of the z-image-based snake model over large extended area. A comprehensive comparison and discussion of this parametrization with the deep learning approach by [42] has also been carried out in this paper.
Concerning the performance assessment, the SRSM is tested in two different geographical contexts, namely Europe (with the Vaihingen benchmark dataset) and North America (with the Quebec City dataset). The two contexts involve various differences in terms of compactness, density and regularity of urban areas [43]. The proposed SRSM yields very high accuracy on the ISPRS Vaihingen benchmark dataset, namely 86.57% of area-based Quality and 81.60% of object-based Quality. These values show that the SRSM is highly desirable, especially as a fully unsupervised method, as opposed to many other high-accuracy methods. Concerning the Quebec City dataset with the total area of 656 km 2 , the SRSM succeeds at providing a relatively high accuracy, namely area-based Quality of 62.37% and object-based Quality of 63.21%. Such accuracy level on this dataset may seem less desirable than the one on the Vaihingen dataset mentioned above. However, it can be well expected on such a large-scale dataset, with various types of complex residential, urban and industrial scenes. Indeed, compared to the building footprints produced by Microsoft by a deep neural network approach, our unsupervised method succeeds at providing a competitive accuracy level. The two geographical contexts also show the very high capacity of the SRSM for extending over very large and complex areas. With the proposed SRSM, this study has achieved our objectives for a scalable, versatile and accurate building extraction solution, in the context of the flood risk assessment in the province of Quebec. Future works will focus on improving the resulting geometrical accuracy, as well as on several remaining problems such as shadowed vegetation, and mis-detection of small buildings. Acknowledgments: The authors would like to thank the Centre GéoStat (Université Laval), as well as the Communauté Métropolitaine de Québec (QC, Canada) for providing the Quebec City datasets used in this work. They also would like to thank the City of Quebec for providing the Empreintes des bâtiments dataset used as ground truth building footprints in this work. A special thank goes to Dr. Eric Janssens-Coron from the Centre de Recherche en Données et Intelligence Géospatiales (Université Laval) for his help on the management of the Quebec City datasets. They also would like to thank the Microsoft Bing Maps team for developing and releasing the open Canada building footprints. The Vaihingen dataset was provided by the German Society for Photogrammetry, Remote Sensing and Geoinformation (DGPF) [53]. Lastly, they would like to thank Dr. Dirk-Jan Kroon from University of Twente for his contributions on the development of the conventional snake models.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A External Image-based Energy Term of Snake Model
The line functional is defined based on the intensity of the image I(x, y), with a filter for smoothing or noise reduction, such as Gaussian filter: The edge functional is based on the image gradient, which attracts for the snake to move towards edges with high gradient value.
where G σ (x, y) is a two-dimensional Gaussian function with a standard deviation σ and * denotes the 2-D convolution operator. Curvature of level lines in a slightly smoothed image can be used to detect corners and line segment terminations in an image. Using this method, let C(x, y) = G σ * I(x, y) be the smoothed image. With an angle θ = tan −1 (C y /C x ), the unit vectors which are along and perpendicular to the gradient direction are: n = (cos θ, sin θ), n ⊥ = (− sin θ, cos θ) (A3) The termination functional of energy is defined as:

Appendix B Super-solution quality metrics
Given the super-resolved image I and the reference image R, the Structural Similarity (SSIM) quality assessment index is based on the computation of three terms, namely the luminance term, the contrast term and the structural term.
where l(X, Y) = 2µ X µ Y + C 1 with µ X , µ Y , σ X , σ Y , and σ XY respectively are the local means, standard deviations, and cross-covariance for images X and Y. The parameters for SSIM index are set as follows, γ = δ = = 1; and C 1 = (0.01 × L) 2 , C 2 = (0.03 × L) 2 , C 3 = C 2 /2, where L = 2 #bits per pixels − 1 denotes the dynamic range value of the images. With these parameters, the SSIM index (A5) is simplified into, SSIM(I, R) = 2µ I µ R + C 1 Another metric for evaluating a method of super-resolution of image is Peak Signal-to-Noise Ratio (PSNR) in decibels, which is defined by Equation (A8).
PSNR(I, R) = 10 × log 10 peak_val 2 MSE(I, R) where peak_val is the maximum possible value of the images, and MSE is the mean square error between I and R.