Consistent Roof Geometry Encoding for 3 D Building Model Retrieval Using Airborne LiDAR Point Clouds

A 3D building model retrieval method using airborne LiDAR point clouds as input queries is introduced. Based on the concept of data reuse, available building models in the Internet that have geometric shapes similar to a user-specified point cloud query are retrieved and reused for the purpose of data extraction and building modeling. To retrieve models efficiently, point cloud queries and building models are consistently and compactly encoded by the proposed method. The encoding focuses on the geometries of building roofs, which are the most informative part of a building in airborne LiDAR acquisitions. Spatial histograms of geometric features that describe shapes of building roofs are utilized as shape descriptor, which introduces the properties of shape distinguishability, encoding compactness, rotation invariance, and noise insensitivity. These properties facilitate the feasibility of the proposed approaches for efficient and accurate model retrieval. Analyses on LiDAR data and building model databases and the implementation of web-based retrieval system, which is available at http://pcretrieval.dgl.xyz, demonstrate the feasibility of the proposed method to retrieve polygon models using point clouds.


Introduction
Recent development on 3D scanning and modeling technologies has led to an increasing number of 3D models, and most of the models have been made available in web-based platforms with data-sharing service.In this context, the question of "How to generate 3D building models?" may evolve to "How to find them in model databases and in the Internet?"[1].This concept motivates this study to encode unorganized, noisy, and incomplete building point clouds acquired by airborne LiDAR and efficiently retrieve 3D models from databases and the Internet.A set of complete or semi-complete building models is retrieved and reused instead of reconstructing point clouds using complicated modeling techniques [2][3][4][5].
The naive approach called text-based retrieval uses keywords in metadata to search for the desired 3D models.This method is simple and efficient, but using keywords as queries suffers from difficulties caused by inappropriate annotations and language varieties.By contrast, content-based retrieval is a promising approach that encodes geometries of queries and models using a shape descriptor and matches the encoded coefficients for data retrieval.Most previous studies on content-based 3D model retrieval take polygon models as input queries [6,7].These methods can efficiently and accurately extract models from databases.However, obtaining a desired polygon model as input query is difficult, which limits the usage of retrieval systems.With the aid of airborne LiDAR, which has the capability of efficient spatial data acquisitions, a model retrieval by using LiDAR point clouds as input queries is proposed.By consistently encoding point cloud and polygon models, a set of building models similar to an input point cloud in geometry shape can be retrieved for the purposes of data extraction and efficient cyber city modeling.Based on the concepts of data reuse and crowsourcing, the proposed system can efficiently construct a 3D city model which is one of key components in virtual geographic environment.
Content-based model retrieval methods can be classified into two categories, model-based retrieval and view-based retrieval, based on the shape descriptor used in encoding.In model-based retrieval, shape similarities are measured using various geometric descriptors, including shape distribution [6], shape spectral [8], and shape topology [9].In methods based on shape distribution, geometric features defined over feature spaces are accumulated in bins [6].A histogram of these bins is utilized as the signature of a 3D model.In shape spectral methods, geometric shapes are transformed to a spectral domain and the spectral coefficients are used in shape matching and retrieval [8].In topology-based methods, model topologies are represented as skeletons, and retrieval is performed based on the assumption that similar shapes have similar skeletons [9].Unlike model-based methods, view-based methods represent 3D geometric shapes as a set of 2D projections, and 3D models are matched using their visual similarities instead of geometric similarities [7,10,11].Each projection is described by image descriptors, and shape matching is reduced to measurement of similarities among the views of the query object and models in the database.Model-based and view-based methods perform well on existing benchmarks for polygon model encoding and retrieval.However, these methods are not designed for unorganized, noisy, sparse, and incomplete point clouds.Recently, Chen et al. [12] proposed point cloud encoding using a set of low-frequency spherical harmonic functions (SHs).With the preprocessing of data resampling and filling, the approach can alleviate the difficulties caused by sparse and incomplete sampling of point clouds.However, the use of low-frequency SHs decreases the ability to distinguish objects with similar geometric shapes thereby leading to ambiguity in shape description.
To improve shape distinguishability, an roof geometry encoding that integrates shape distribution with visual similarity is proposed.The main idea is to represent point clouds and polygon models using top-view depth images that can describe the shapes of building roofs and avoid disturbances from insufficient sampling of building side-views.The depth images are further encoded by geometry features with spatial histograms, which introduce the properties of compact description, rotation invariance, noise insensitivity, and consistent encoding of point clouds and polygon models.These properties lead to a compact storage size and real-time retrieval response time.Furthermore, the visual similarity in depth images and shape distribution in spatial histograms increase the distinguishability of geometric shapes.The remainder of this paper is organized as follows.Section 2 describes the methodology of point cloud encoding and building model retrieval.Section 3 introduces the properties of the proposed encoding method.Section 4 discusses experimental results, and Section 5 presents conclusions and future work.

System Overview
Figure 1 schematically illustrates the proposed system which consists of three main steps, depth image generation, data encoding, and data retrieval.In the first step, the building models and point cloud, that is, the input query, are represented by top-view depth images for the geometric description of building roofs.An interpolation process is then applied to the depth image of the point cloud for hole filling to facilitate consistent encoding.In data encoding, a set of geometric features are extracted from the interpolated depth images.The extracted geometric features are encoded by utilizing a spatial histogram with a determined origin, which can provide a rotation-invariance shape description.During data retrieval, the encoded coefficients of input query are matched with those of the models in a database using a shape similarity metric.A set of best-matched models are extracted.In this section, the process of depth image generation is described in Section 2.2, followed by the data encoding and retrieval, which are described in Sections 2.3-2.5.

Generation of Depth Image
Top-view projection is selected in depth image generation because of the characteristics of airborne LiDAR scanning.The roof is generally the most informative and identifiable part of a building in airborne LiDAR acquisitions.In the top-view projection, the pixel value of the projection is generally defined as the height difference between the ground and the pixel in the building roof.However, pixel intensity in this definition is dominated by building height, and the roof geometry is insignificant, especially for tall buildings.In this study, the pixel value of depth image, which is denoted as D(x, y), is defined as the distance between the maximal height of building and the height on that position, that is, where Max B represents the maximal height of a building B, and H B (x, y) denotes the building height at the position (x, y).In this definition, the pixel value of depth image corresponds to the relative height difference, which can enhance roof geometry in the depth image.The depth image generation is a process of rasterization.The spatial resolution of a geometric description depends on the setting of grid size in rasterization.A large grid size indicates low possibility of missing depth information and efficient rasterization, but will result in low spatial resolution of depth images and rough geometric representation of point clouds.By contrast, small grid size is linked to a high-resolution depth image with fine geometric description, but will result in time-consuming computation and large number of empty pixels.In this study, small grid size is preferred because data interpolation can be performed to alleviate the problem of missing information.In addition, the grid size is set according to point cloud density with the expectation that, on average, each grid in the depth image has one inside point.Therefore, grid size is defined as GS = 1/AvgPD, where AvgPD represents average point density of a building point cloud.Figure 2 shows an example of point cloud rasterization.The average point density of the building is 16.67 pts/m 2 and grid size GS is set to 0.245 m in rasterization.In this setting, the geometric detail of building roof is preserved and only small holes are present in the depth image.The search for optimal setting of grid size is difficulty because the determination of grid size is data-sensitive.Any advanced approach on that such as the method proposed by Awrangjeb et al. [13] can be adopted and integrated into the system.Holes are generally present in depth images because of rasterization of irregularly distributed LiDAR point clouds.To address this issue, a hole filling and completion process is performed using grayscale morphological operators.A grayscale morphological closing, which consists of dilation and erosion operators with a flat structuring element, is adopted to fill gaps in depth images.The proposed encoding method requires the preservation of point cloud topology and geometry, but it does not require a perfect depth image, that is, an image without holes.Therefore, a simple and efficient morphology-based approach is adopted.With this hole filling, only holes that are smaller than the structuring element are filled; thus, the geometric topology, such as hollow shape, can be maintained.In the implementation, a 7 × 7 structuring element is used.

Data Encoding
The first step of data encoding is determining a translation and rotation invariant origin of a depth image.Chen et al. [12] suggested the selection of the center of minimal bounding box of an object as origin.However, the origin obtained in this approach is slightly sensitive to rotation, especially for non-symmetric buildings.For instance, in Figure 3, the cyan boxes are the minimal bounding boxes of rotated depth images.These bounding boxes differ in height and width, and thus, the obtained origins represented by cyan dots are inconsistent.In this study, a depth image is regarded as a 2.5D model.The barycenter of the 2.5D model is selected as origin because that this center is unchanged after similarity transformation.Examples of obtained origins are shown in Figure 3.The origins displayed in red remain constant as the object is rotated, which implies this origin is invariant to rotation and translation.

Center of Minimal Bounding Box
Barycenter of 2.5D model In formula, given a depth image D(x, y), the shape origin (x 0 ,y 0 ) is defined as the weighted position in the depth image, that is, where n represents the number of roof pixels, and D i denotes the value of D(x i , y i ), which is used as the weight of vector (x i , y i ).Before data encoding, the point cloud and the models are aligned to their origins to facilitate rotation and translation invariant encoding.Three geometric features, namely, height feature, edge feature, and planarity feature, as illustrated in Figure 4, are used to describe the shape of a building roof in depth image domain.These three features are described as follows.Height Feature.A building roof is represented as a depth image, and its pixel intensity denotes the relative height of the roof in that position.Therefore, using pixel intensities in a depth image to represent roof geometry is intuitive and efficient.Following the study [14], height feature, which is denoted as F h (x, y), is defined as the pixel intensities of depth images, that is, F h (x, y) = D(x, y).
Edge Feature.Sharp edges in an image are linked to high-frequency components that represent details in an image.Therefore, sharp edges are used as geometric features to represent roof geometries.Inspired by the study [15], Laplacian of Gaussian (LoG) filter is adopted to extract sharp edges while alleviating disturbances from noise.The LoG is composed of Gaussian and Laplacian filters, where the low-pass kernel function in the Gaussian filter is used to suppress noise and the second derivative kernel function in Laplacian filter is utilized to extract sharp edges.The LoG is applied to depth images with inherent noises, and the resulting data are used as edge feature, which is denoted by F e (x, y).Planarity Feature.Different from sharp edges, which are high-frequency components, planes in an image relate to low-frequency components that represent rough information in an image.Therefore, the planarity feature is selected as geometric feature.The planarity feature from the principal component analysis of a point set is a useful descriptor that can describe the local geometry of a point set and indicate whether local geometry is planar.Given a 2.5D point set P = {(x i , y i , D i )} n i=1 from the pixels in a depth image where the points are located within a circle of diameter r centered at a position p c : (x c , y c , D c ), a simple method for computing the principal components of the point set P is to diagonalize the covariance matrix of P. In matrix form, the covariance matrix of P is written as The eigenvectors and eigenvalues of the covariance matrix are computed using matrix diagonalization, that is, V −1 CV = D, where D is the diagonal matrix containing the eigenvalues {λ 1 , λ 2 , λ 3 } of the covariance matrix C, and matrix V contains the corresponding eigenvectors.In geometry, the eigenvalues relate to an ellipsoid that represents the local geometric structure of a point set.The combinations of these eigenvalues provide discriminant geometric features.For instance, λ 1 ∼ = λ 2 λ 3 indicates a flat ellipsoid that can represent planar structures, and λ 1 ∼ = λ 2 ∼ = λ 3 corresponds to a volumetric structure, such as the corners of buildings.
Following the definitions in [16,17], planarity feature is defined as (λ 2 − λ 3 )/λ 1 , which has the ability of enhancing and describing planar structures.The planarity feature F λ (x, y) obtained from principal component analysis is used as geometric feature.
The height, edge, and planarity features provide the descriptions of height, high-frequency, and low-frequency components of a building roof, respectively.These three features compose a complete set of geometric description for a depth image and a building roof.An example of geometric features is shown in Figure 5.To visualize these features, the feature values are normalized to the range of [0, 255] and displayed as a gray image.

Spatial Histogram
The purposes of using spatial histogram in encoding are to reduce the data sizes of geometric features and to achieve rotation-invariance encoding.In spatial histogram generation, the feature image is partitioned into several disjoint circular subspaces or called bins, which are all centered on the determined origin.Similar to image histogram, the intensities of pixels in the feature image are accumulated in bins according to their positions.The accumulated value in each bin is then normalized by the sum of pixel intensities.The range of bin value is [0.0, 1.0] and the sum of all values in bins is 1.0.The distribution of pixel intensities and spatial positions of the pixels are encoded in the histogram, which leads to an encoding with the ability of geometric distinguishability.In the implementation, the feature image is circularly partitioned into k bins with equal width, and the maximum radius of the depth image is set to the distance between the farthest pixel and the determined origin.The number of bins k is set to 30 to balance the encoding compactness and shape distinguishability.A small k may lead to a rough and insufficient description of a geometric shape, whereas a large k may cause inefficient retrieval and increase the sensitivity to noise.
With spatial histogram, the height, edge, and planarity geometric features are described and encoded as ( To further distinguish the building of objects with similar roof geometries but different sizes, the area of the roof, the height of building, and the maximum radius are integrated in the encoding, which can lead to a scale-variant shape representation.The maximal height of a building, denoted as M h , is defined as where the function max() returns the maximum of an input, and H min represents the minimum of the roof height H B (x, y).The maximum radius, denoted as M r , is defined as where the function max_radius() returns the maximal distance in the xy-space between the roof pixel and its origin.The building outliers, that is, H B (x, y) = H min , are excluded from the distance calculation because outliers are generally the protruded exterior facades lying near the roof boundaries.The roof area, denoted by A f , is defined as where the function count() returns the total number of pixels in the input, except pixels with the minimal value H min .

Data Retrieval
By combining the spatial histograms of three geometric features, a point cloud or polygon model is encoded as In addition, the maximal height, maximal radius, and roof area for scale-variance encoding are combined as Given the encoding in ( 6) and ( 7), the measurement of shape similarity is formulated as a combination of F 1 and F 2 .Given a point cloud P and a building model M, shape similarity is defined as where the first part of the equation, |F 1 (P) − F 2 (M)|, is the measure of geometric similarity between the objects P and M, and the second part, 1 + 2 × |F 2 (P)−F 2 (M)| |F 2 (P)+F 2 (M)| , which ranges from 1.0 (no penalty) to 2.0 (maximal penalty), denotes the penalty for different object scales.No penalty is given for buildings with the same scale, and a maximal penalty is set for buildings with extremely large difference on scale.In (8), a small dist represents a high similarity between the point cloud P and the polygon model M, and vice versa.
The building models downloaded from the Internet are encoded in preprocessing using ( 6) and (7).The encoded coefficients are stored in a database.When a point cloud is selected as input query in the online retrieval system (http://pcretrieval.dgl.xyz), the point cloud is also encoded by using ( 6) and (7).The obtained coefficients are then matched with the coefficients in the database using (8).Shape similarities are sorted and several top-ranking models are extracted as the query response.

Encoding Properties
The proposed encoding method possesses several properties that make a retrieval system feasible to extract polygon building models using point clouds as input queries.The point clouds and polygon models are consistently encoded with the aids of morphological hole-filling and consistent origin determination.To demonstrate this property, a building model and its corresponding point cloud were tested.The encoding results in Figure 7 show that the encoded height, edge, and planarity coefficients of the point cloud and polygon model are similar.This result indicates consistent encoding and the feasibility of retrieving polygon models by using point clouds.Second, the proposed approach provides a shape similarity metric, wherein similar shapes have small distances and dissimilar ones have large distances.This property implies that encoding can distinguish objects with different geometric shapes, which is the foundation of data retrieval.Third, the encoded coefficients are rotation invariant because of the spatial histogram and the rotation-invariance origin.Figure 8 shows that the coefficients remain unchanged when a 45-degree rotation is applied to the object.Fourth, the proposed encoding scheme is insensitive to noise because the noise problem is alleviated by statistics-based histogram encoding.For example, in Figure 9, a point cloud with Gaussian noise is tested.The standard deviation in the Gaussian noise is set to 0.1.
The results show that the encoded coefficients exhibit slight differences when the Gaussian noise is added to the data.Fifth, the building models in the database are compactly encoded.The encoded coefficient for a building model is a set of 90 real numbers, which is smaller than the size of original model data.With the aforementioned properties, the proposed retrieval method can efficiently and accurately retrieve building models using point clouds.

Datasets
A database containing about one million 3D building models was tested.The models are downloaded from the Internet using web crawlers that systematically browse the Internet to search for 3D building models.The study area is the campus of National Cheng-Kung University, Taiwan and Taipei 101 building.The airborne LiDAR point cloud in the study area was acquired by Optech ALTM 30/70 in 2011.This point cloud was extracted from a combination of three overlapped scanning strips.The number of points in the test LiDAR point cloud is 2,772,880 and the average density is 10.72 pts/m 2 or a point spacing of 0.31 m.The LiDAR point cloud of the study area and its corresponding aerial image are shown in Figure 10.The trees nearby buildings were removed and building point clouds were segmented and extracted from the test data manually for simplicity.Any advanced point cloud segmentation or building extraction algorithms [18][19][20] can be adopted to extract building point clouds automatically or semi-automatically.

Computational Performance
The encoding time for a point cloud of 22,440 points and about 2000 m 2 areas is 150 ms, and the retrieval response time is 600 ms for a database containing one million models.The information of the tested building point clouds, grid sizes, and computation time are shown in Table 1.The grid sizes are automatically calculated according to the average point density of the building.Encoding time mainly depends on the number of points and the xy-area of a building, and the retrieval time is linearly dependent on the size of the model database.

Evaluation of Model Encoding
To evaluate the proposed method, two sets of building models that have different level-of-details (LODs) on roofs were tested.The purpose of this experiment is to evaluate the distinguishability of encoding approaches on building models with different geometric roof details.The first dataset that contains four models is a building with hollow geometry.The original model with hollow geometry, which is denoted by LoD a 1 , is gradually reduced to a non-hollow model with simplified roof geometry, which is denoted as LoD The shape similarities between the models and their corresponding point clouds are measured using (8).The models and their similarity values are shown in Figure 12.For the first dataset, the results indicate that the proposed encoding method can separate the model with hollow geometry from the model without hollow geometry.The results in the second dataset show that the shape similarities decrease near-linearly when the geometry of the original model is simplified gradually.These experiments demonstrates the ability to describe hollow geometric shapes and distinguish models with different details on roof geometry, which are superior over the related method [12] that encodes models using a set of low-frequency spherical harmonic functions.Because of the use of low-frequency and spherical basis functions, the method [12] is not able to distinguish models with similar geometry details and models with and without hollow geometries.

Evaluation of Model Retrieval
The proposed method was compared with the method [12] by using a database that contains around one million 3D building models downloaded from the Internet.Three building point clouds were tested as the input queries.To evaluate the retrieval accuracy, the commonly used measurement, namely, root mean square deviation (RMSD), is adopted to calculate the shape similarities between the input query and the retrieved models.Before RMSD calculation, the retrieved models are aligned with the input query semi-automatically using the standard registration algorithm called iterative closet point [21].The shape similarities obtained by RMSD are used as reference values in retrieval evaluation.Figure 13 shows the retrieval results, including the retrieved models, ranks, and reference RMSD values.The visual comparisons show that the models extracted by these two methods are similar to the queries in shape, and the ranking generated by the proposed method is better than that in [12], especially for the third data.This experiment shows that the proposed method combining shape distribution with visual similarity can improve the ability to distinguish geometric shapes, compared with [12].However, the ranks obtained by these two approaches are not well-matched with that of RMSD references.For instance, in the first test data, the RMSD of the ranked 5th model is larger than that of rank 7th model, which means that the ranking is inconsistent with the reference.The imperfect ranking is caused by the use of compact shape description.To achieve efficient retrieval, data are compactly encoded by shape description, and only 90 coefficients are used in the proposed method to encode a point cloud or a polygon model.To solve this problem, the ranks of the extracted models can be further refined using RMSD measurement if perfect ranking is required.To further analyze retrieval performance, the ranking of all extracted models by using RMSD is used as reference.Then, the measurement for ranking is defined as the difference between the obtained ranking and reference, that is, Di f f r = |R method − R re f |, where R re f and R method represent the reference ranking and the ranking from a method, respectively.In this measure, the average of Di f f r , which is denoted as Avg.Di f f r , is used to estimate ranking quality.The statistical analysis for Avg.Di f f r is shown in Table 2.A small Avg.Di f f r means high-quality ranking.In addition, the commonly used measurements precision η s and recall η n were adopted to evaluate the retrieval accuracy using the Data #3 in Figure 13.These measurements are defined as η s = TP TP+FP and η n = TP TP+FN , where TP, FP, and FN represent true positive, false positive, and false negative, respectively.The results are shown in Table 3. From the statistical analysis, the retrival results by the proposed method are superior to that by [12].

Conclusions and Future Work
This study proposed a new method for 3D building model retrieval using LiDAR point clouds as input query.To archive consistent encoding of polygonal models and point clouds, rotation-invariance origin determination is adopted, which utilizes object volume to determine the origin instead of using the minimal bounding box of an object.In addition, the morphological closing operator is applied to fill the holes in the depth images of point clouds.Filling holes effectively alleviates the difficulties caused by sparse and incomplete sampling of point clouds and facilitates consistent encoding.The proposed encoding that integrates shape distribution with visual similarity can increase the distinguishability of geometric shapes.The experiments on building models with different LoDs show that ambiguous shape description is avoided.The experiments also demonstrate that using the spatial histogram of geometric features as shape descriptor introduces the properties of noise-insensitivity and rotation-invariance to the retrieval.These properties facilitate the feasibility of the proposed approaches to achieve efficient and accurate model extraction.Based on the qualitative and quantitative analyses on LiDAR data and the building models and based on the implemented web-based retrieval system, we conclude that retrieve building models by using point clouds is feasible.In the future, we plan to extend the proposed method to point clouds for photogrammetry techniques and terrestrial LiDAR, which also have a great need of 3D modeling.

Figure 1 .
Figure 1.System overview.The system consists of three main steps, depth image generation, data encoding, and data retrieval.

Figure 2 .
Figure 2. Depth image.(Left, Middle): perspective and top views of point cloud, respectively; (Right): top view of depth image.

Figure 3 .
Figure 3. Comparisons between the origins obtained by using the minimal bounding box (the cyan dots) and the building volume (the red dots).

Figure 5 .
Figure 5. Results of geometric features.(Left): original point cloud; (Right): results of height, edge, and planarity features of the point cloud, respectively.
and (p 1 , • • • , p k ), respectively.The total number of components in an encoded coefficient is 90.Examples of spatial histograms with 10 bins are shown in Figure 6.The height, edge, and planarity features used to describe different geometrics have different spatial histogram results.

Figure 6 .
Figure 6.Results of the spatial histograms of height (left); edge (middle); and planarity features (right); Feature images (top) and corresponding spatial histograms of 10 bins (bottom).

Figure 7 .
Figure 7. Consistent encoding of point cloud (top) and building model (bottom).1st row: original data; 2nd-4th rows: results of height, edge and planarity features and their corresponding spatial histograms.

Figure 9 .
Figure 9. Demonstration of noise insensitivity.(Top): point cloud encoding results; (Bottom): encoding results of a point cloud with Gaussian noise addition.

Figure 10 .
Figure 10.Study area.The aerial image (left) and corresponding point cloud (right).

4 .
The simplified and detached parts of the buildings are marked by red.The second dataset contains five models, where the original model with detailed geometry on roof, denoted by LoD b 1 , is gradually simplified to be a simple box, denoted by LoD b 5 .The aerial photos, LiDAR point clouds, and well-constructed building models, which are, LoD a 1 and LoD b 1 , are shown in Figure 11.The point clouds and the models are encoded using the proposed method.

Figure 12 .
Figure 12.Tested models with different LODs.The numbers shown below the models represent the shape similarities between the models and their corresponding point clouds.

Figure 13 .
Figure 13.Comparisons among the retrieval results obtained by our method (Method A) and the related method [12] (Method B).

Table 2 .
Chen et al. (2014)n retrieved rankings using the proposed method and the method byChen et al. (2014).1st column: RMSD of the retrieved model; 2nd column: reference ranking; 3rd column: Di f f r value.Two numbers in a pair denotes the performances of the compared methods.

Table 3 .
[12]ision-and-recall of our method (Method A) and the related method[12](Method B).The tested data is Data #3 in Figure13is used as tested data.