Semantic-Based Building Extraction from LiDAR Point Clouds Using Contexts and Optimization in Complex Environment

The extraction of buildings has been an essential part of the field of LiDAR point clouds processing in recent years. However, it is still challenging to extract buildings from huge amount of point clouds due to the complicated and incomplete structures, occlusions and local similarities between different categories in a complex environment. Taking the urban and campus scene as examples, this paper presents a versatile and hierarchical semantic-based method for building extraction using LiDAR point clouds. The proposed method first performs a series of preprocessing operations, such as removing ground points, establishing super-points and using them as primitives for subsequent processing, and then semantically labels the raw LiDAR data. In the feature engineering process, considering the purpose of this article is to extract buildings, we tend to choose the features extracted from super-points that can describe building for the next classification. There are a portion of inaccurate labeling results due to incomplete or overly complex scenes, a Markov Random Field (MRF) optimization model is constructed for postprocessing and segmentation results refinement. Finally, the buildings are extracted from the labeled points. Experimental verification was performed on three datasets in different scenes, our results were compared with the state-of-the-art methods. These evaluation results demonstrate the feasibility and effectiveness of the proposed method for extracting buildings from LiDAR point clouds in multiple environments.


Introduction
Building objects management is of great importance for many applications in various fields, including city planning, energy analysis, 3D reconstruction and visualization, etc. As a significant requirement of smart cities, building extraction from various remote sensing data plays an increasingly critical role in the aforementioned applications. In particular, automatic or semi-automatic building extraction algorithms from images have been extensively studied in the past. However, image distortions caused by camera lens limit the accuracy and these approaches labor-intensive, time consuming, and costly in a poor condition. Light detection and ranging (LiDAR) technology has developed rapidly in recent years. It can rapidly The remainder of this paper is organized as follows. Following the introduction, the key components of our proposed method are carefully illustrated in Section 2. In Section 3, the experimental studies and analysis are elaborated. Section 4 discussed the results of the experiments. The conclusion is given at the end of this paper.

Materials and Methods
The proposed method is carried out according to a hierarchical process, the workflow is shown in Figure 1. The LiDAR point clouds were firstly identified as ground and off-ground points using an existing ground filtering algorithm [22] to eliminate the connectivity between different objects. And then, outlier and noise filtering are performed in off-ground points. The further process consists of three main steps: 1.
Non-ground points are over-segmented to generate super-points; 2.
Local feature sets selection and extraction; 3.
Building extraction based on point cloud classification using context information.
Sensors 2020, 20, x FOR PEER REVIEW 3 of 18 results by MRFs, which do not require fully supervised training scenes. This improves the labeling results by reducing unnecessary categories used in describing a region. Finally, based on label, the proposed hierarchical method extracts building from MLS/TLS data in urban and campus environments. The remainder of this paper is organized as follows. Following the introduction, the key components of our proposed method are carefully illustrated in Section 2. In Section 3, the experimental studies and analysis are elaborated. Section 4 discussed the results of the experiments. The conclusion is given at the end of this paper.

Materials and Methods
The proposed method is carried out according to a hierarchical process, the workflow is shown in Figure 1. The LiDAR point clouds were firstly identified as ground and off-ground points using an existing ground filtering algorithm [22] to eliminate the connectivity between different objects. And then, outlier and noise filtering are performed in off-ground points. The further process consists of three main steps: 1. Non-ground points are over-segmented to generate super-points; 2. Local feature sets selection and extraction; 3. Building extraction based on point cloud classification using context information.
Each step of our method detailed as follows.

Super-Points Generation of Non-Ground Points
First, the LiDAR raw point clouds are inverted, and the inverted surface then is cover by a rigid cloth. The locations of the cloth nodes are determined by analyzing the interactions between them and the corresponding points, which can generate an approximation of the ground surface. Finally, the ground points can be extracted from the LiDAR point cloud by comparing the original LiDAR points and the generated surface. After ground points are separated from the scene, spatially relatively independent non-ground points are obtained, but the amount of data is still huge. It is highly challenging to point-wise process, for example, a heavy computing cost. In order to improve the segmentation efficiency for large scale scenes and reduce the heavy burden of a large number of points, the proposed method divided raw scene space into super-points which was taken as basic units in further processing.
The super-points generation in the proposed method is different from other segmentation algorithms, in which points within each super-point have consistent geometric characteristics and Each step of our method detailed as follows.

Super-Points Generation of Non-Ground Points
First, the LiDAR raw point clouds are inverted, and the inverted surface then is cover by a rigid cloth. The locations of the cloth nodes are determined by analyzing the interactions between them and the corresponding points, which can generate an approximation of the ground surface. Finally, the ground points can be extracted from the LiDAR point cloud by comparing the original LiDAR points and the generated surface. After ground points are separated from the scene, spatially relatively independent non-ground points are obtained, but the amount of data is still huge. It is highly challenging to point-wise process, for example, a heavy computing cost. In order to improve the segmentation efficiency for large scale scenes and reduce the heavy burden of a large number of points, the proposed method divided raw scene space into super-points which was taken as basic units in further processing.
The super-points generation in the proposed method is different from other segmentation algorithms, in which points within each super-point have consistent geometric characteristics and appearance. Its purpose is to divide point cloud into smaller clusters, not to achieve a certain segment. The proposed method focuses on building extraction from LiDAR point clouds, and it is necessary to Sensors 2020, 20, 3386 4 of 18 preserve object boundaries well. Several existing methods face a challenge due to the LiDAR point clouds with non-uniform density and is often overlap. VCCS (Voxel Cloud Connectivity Segmentation) algorithm [23] and its related methods [24][25][26][27] may not effectively preserve boundary information. In addition, some advanced algorithm [28], can preserve object boundaries and small structures more effectively, but it is likely to be sensitive to the data quality.
To make super-points conform better to object boundaries and provide accurate geometric information for further processing, we replace the adjacency octree index in the VCCS algorithm with K-nearest neighbor to expand the super-points [29]. Unlike VCCS, which selects seeds with a unified resolution, the proposed method adopts the k-NN search to establish adjacencies between super-points by the neighboring relationships. Moreover, to benefit the preservation of more geometric features, the proposed method works directly on the original data instead of a voxelized point cloud. The super-points generated in the proposed method are adequately homogeneous and derive accurate local geometric information (as shown in Figure 2c). In this study, the features of a point in one super-point were calculated using all of these points in this super-point, meaning the features of all points within this super-point were the same, and all points were assigned the same class label within one super-point [30].
Sensors 2020, 20, x FOR PEER REVIEW 4 of 18 appearance. Its purpose is to divide point cloud into smaller clusters, not to achieve a certain segment. The proposed method focuses on building extraction from LiDAR point clouds, and it is necessary to preserve object boundaries well. Several existing methods face a challenge due to the LiDAR point clouds with non-uniform density and is often overlap. VCCS (Voxel Cloud Connectivity Segmentation) algorithm [23] and its related methods [24][25][26][27] may not effectively preserve boundary information. In addition, some advanced algorithm [28], can preserve object boundaries and small structures more effectively, but it is likely to be sensitive to the data quality.
To make super-points conform better to object boundaries and provide accurate geometric information for further processing, we replace the adjacency octree index in the VCCS algorithm with K-nearest neighbor to expand the super-points [29]. Unlike VCCS, which selects seeds with a unified resolution, the proposed method adopts the k-NN search to establish adjacencies between superpoints by the neighboring relationships. Moreover, to benefit the preservation of more geometric features, the proposed method works directly on the original data instead of a voxelized point cloud. The super-points generated in the proposed method are adequately homogeneous and derive accurate local geometric information (as shown in Figure 2c). In this study, the features of a point in one super-point were calculated using all of these points in this super-point, meaning the features of all points within this super-point were the same, and all points were assigned the same class label within one super-point [30].

Local Feature Sets Selection and Extraction
All points within one super-point is assigned the same label, so these points are characterized by analogous properties. Super-point is treated as a basic operational unit, which means that different local features will be extracted based on the derived super-point neighborhood after over-segmentation. The biggest benefits are robustness to noise and outliers, and reduced computational cost. As an essential process in building extraction, point cloud classification needs to fully consider the local feature types of point clouds which can distinguish building objects. It also needs to ensure the consistency of buildings and extract building objects completely. Feature selection and extraction serve as the basis for 3D semantic segmentation. There is no doubt that their performance plays a decisive role in classification and subsequent processing.
As the most important man-made object in urban scenes, the building structure has obvious geometric features. After generating the super-points, we carefully selected some types of local features in this study: height, orientation, planar, covariance and projection features. These features described the differences between building and other objects in the scene in several ways. According to the geometric features of the clusters, we constructed a set of feature vectors for classification, as shown in Table 1.
, which consists of the normalized height D z , elevation difference standard deviation σ h of the height feature, the covariance features (including eigenvalues λ 1 , λ 2 , λ 3 (λ 1 ≥ λ 2 ≥ λ 3 > 0); L λ = λ 1 − λ 2 /λ 1 , P λ = λ 2 − λ 3 /λ 1 , S λ = λ 3 /λ 1 , which are the linear, planar, and volumetric geometric features; anisotropic feature A λ = λ 1 − λ 3 /λ 1 ; represented by the angle θ between normal vector of each superpoint and normal vector of the horizontal plane, planar geometric structures D and projection features PA h and PA v . Different types of features have different saliences for different objects. A combination of features separates multiple objects in outdoor scene as much as possible. The heat map distribution of features in scene under different features is shown as Figure 3. It can be found that the height features are more prominent in buildings and trees, the orientation in buildings, roads and power lines, the planar features on buildings and the ground, and the volumetric features on trees. Significantly, the projection features are more prominent on the ground point cloud, which means that different features have a certain ability to distinguish special objects, so integrating multiple types of features will help distinguishing building from the scene. Once a variety of local features of point clouds has been extracted, it has to be considered that there may be redundant or irrelevant information with respect to the semantic segmentation. Hence, it is often desirable to select a compact subset of relevant features that can achieve the best performance. The purpose of feature selection [31] is to remove features with weak classification ability, a significant increase of classification efficiency as well as accuracy can be expected due to much less involved information. The feature selection in proposed method mainly includes two steps: (1) Obtain and rank the importance index of each feature to the category by derived scores, and the lower ranking feature is considered to have a weak classification ability; (2) Calculate the correlation coefficient between the features. If the correlation coefficient between the two features is higher, the lower-ranked feature is considered to be a redundant feature and can be deleted. The feature combination obtained by setting the correlation coefficient threshold is the training feature vector set that the final point cloud classification depends on. To avoid a classifier-dependent solution for deriving feature subsets, we directly calculate relevance from training data by a multivariate filterbased feature selection [32] where evaluates intrinsic properties of the given data. The value of the feature can be regarded as continuous in a certain interval, we evaluate score function with respect to both feature-class and feature-feature relations. As relevant properties of the given data may be relevant for scene analysis, the correlation between the two continuous variables is calculated by several measures [33], such as, information gain (a measure revealing the dependence between a feature and a class label) [34] and Pearson correlation coefficient (a measure indicating the degree a feature is correlated with a class label) [35]. Following the provided implementation, a higher value indicates more relevance. Figure 4 (a, b) are the importance ranking of features and the effect of feature selection on classification accuracy, respectively. It can be found that F14 (horizontal projection feature ), F3 (feature value ) and F15 (minimum vertical projection feature ) have the least importance, for this reason, we assume the three worst-ranked feature of importance metric to be pointless in the experiments. Following the principle of forward selection, we begin with only the most importance feature. Subsequently, the derived order of the features is used to successively train and test the classifiers with one additional feature per iteration. As shown in Figure 4 (b), the classification accuracy reaches the highest precision value of 0.903 after adding the feature (F3) while the accuracy decreases after adding the feature, indicating that the feature has an impact on the classification accuracy and can be deleted. Once a variety of local features of point clouds has been extracted, it has to be considered that there may be redundant or irrelevant information with respect to the semantic segmentation. Hence, it is often desirable to select a compact subset of relevant features that can achieve the best performance. The purpose of feature selection [31] is to remove features with weak classification ability, a significant increase of classification efficiency as well as accuracy can be expected due to much less involved information. The feature selection in proposed method mainly includes two steps: (1) Obtain and rank the importance index of each feature to the category by derived scores, and the lower ranking feature is considered to have a weak classification ability; (2) Calculate the correlation coefficient between the features. If the correlation coefficient between the two features is higher, the lower-ranked feature is considered to be a redundant feature and can be deleted. The feature combination obtained by setting the correlation coefficient threshold is the training feature vector set that the final point cloud classification depends on. To avoid a classifier-dependent solution for deriving feature subsets, we directly calculate relevance from training data by a multivariate filter-based feature selection [32] where evaluates intrinsic properties of the given data. The value of the feature can be regarded as continuous in a certain interval, we evaluate score function with respect to both feature-class and feature-feature relations. As relevant properties of the given data may be relevant for scene analysis, the correlation between the two continuous variables is calculated by several measures [33], such as, information gain (a measure revealing the dependence between a feature and a class label) [34] and Pearson correlation coefficient (a measure indicating the degree a feature is correlated with a class label) [35]. Following the provided implementation, a higher value indicates more relevance. Figure 4a,b are the importance ranking of features and the effect of feature selection on classification accuracy, respectively. It can be found that F14 (horizontal projection feature PA h ), F3 (feature value λ 1 ) and F15 (minimum vertical projection feature PA v ) have the least importance, for this reason, we assume the three worst-ranked feature of importance metric to be pointless in the experiments. Following the principle of forward selection, we begin with only the most importance feature. Subsequently, the derived order of the features is used to successively train and test the classifiers with one additional feature per iteration. As shown in Figure 4b, the classification accuracy reaches the highest precision value of 0.903 after adding the λ 1 feature (F3) while the accuracy decreases after adding the PA v feature, indicating that the feature has an impact on the classification accuracy and can be deleted.  After deleting the minimum vertical projection feature by importance judgment, the relevance metric between features is calculated according to importance ranking, as shown in Figure 5 (a). Set the correlation threshold to 0.5~1, and judge relationship between correlation and classification accuracy under different thresholds, as shown in Figure 5 (b). In this paper, the feature where the pairwise correlation coefficient is greater than or equal to is considered as a candidate redundant feature and needs to be deleted. When classification accuracy reached the highest value of 0.913, its corresponding feature correlation is 0.92, so is set to 0.92. F11 (curvature feature ) and F8 (spheroidal feature based on eigenvalue), F1 (normalized height feature ) and F2 (height standard deviation feature ) satisfy the condition. Furthermore, since the importance of is greater than , the spherical feature is deleted; similarly, the normalized height feature is discarded since the importance of is greater than .
(a) (b) Finally, after combining two constraints of feature importance and correlation, minimum vertical projection feature , eigenvalue-based divergence feature and normalized height feature ∆ℎ are deleted, and the optimal feature set , , , , , , , , , , , is After deleting the minimum vertical projection feature by importance judgment, the relevance metric between features is calculated according to importance ranking, as shown in Figure 5a. Set the correlation threshold c t to 0.5~1, and judge relationship between correlation and classification accuracy under different thresholds, as shown in Figure 5b. In this paper, the feature where the pairwise correlation coefficient is greater than or equal to c t is considered as a candidate redundant feature and needs to be deleted. When classification accuracy reached the highest value of 0.913, its corresponding feature correlation is 0.92, so c t is set to 0.92. F11 (curvature feature C λ ) and F8 (spheroidal feature S λ based on eigenvalue), F1 (normalized height feature D z ) and F2 (height standard deviation feature σ h ) satisfy the condition. Furthermore, since the importance of C λ is greater than S λ , the spherical feature S λ is deleted; similarly, the normalized height feature D z is discarded since the importance of σ h is greater than D z .  After deleting the minimum vertical projection feature by importance judgment, the relevance metric between features is calculated according to importance ranking, as shown in Figure 5 (a). Set the correlation threshold to 0.5~1, and judge relationship between correlation and classification accuracy under different thresholds, as shown in Figure 5 (b). In this paper, the feature where the pairwise correlation coefficient is greater than or equal to is considered as a candidate redundant feature and needs to be deleted. When classification accuracy reached the highest value of 0.913, its corresponding feature correlation is 0.92, so is set to 0.92. F11 (curvature feature ) and F8 (spheroidal feature based on eigenvalue), F1 (normalized height feature ) and F2 (height standard deviation feature ) satisfy the condition. Furthermore, since the importance of is greater than , the spherical feature is deleted; similarly, the normalized height feature is discarded since the importance of is greater than .
(a) (b) Finally, after combining two constraints of feature importance and correlation, minimum vertical projection feature , eigenvalue-based divergence feature and normalized height feature ∆ℎ are deleted, and the optimal feature set , , , , , , , , , , , is Finally, after combining two constraints of feature importance and correlation, minimum vertical projection feature PA v , eigenvalue-based divergence feature S λ and normalized height feature ∆h are deleted, and the optimal feature set [σ h , λ 1 , λ 2 , λ 3 , L λ , P λ , A λ , O λ , C λ , θ, D, PA h ] is obtained. Moreover, after removing redundant features, the classification accuracy improved from 0.903 to 0.913, which indicates that feature redundancy will affect the classification result.

Label Refinement by Higher Order MRF
The above features are scaled into a range [0, 1] before being applied to the classifier. To recognize candidate objects from complex environment, a Random Forest (RF) classifier was used for point cloud classification. Specifically, the classifier was trained on manually labeled data; the proposed method classifies the entire scene through a trained RF classifier. Unfortunately, using only local features in prone to label noise, which means that the classification results lack consistency. We consider more context information to optimize the results. The MRFs can describe the relationship and interactions among adjacent data and are used to perform spatial context construction.
We formalize the solution of point cloud optimal classification label configuration as the maximum posteriori probability estimation problem of MRFs. Inspired by the work of computer vision, this problem can be naturally formulated in terms of energy function minimization which is designed as follows: where, E data (L) is first-order data term which measures the disagreement between label and raw data, while second-order smooth term E smooth (L) mainly describe the inconsistency of labels in local neighborhoods based on local context information; λ is the weight coefficient between the first-order potential and the second-order potential. In this paper, the point cloud classification results are obtained by solving the minimized energy function. Local neighborhood construction is the most important part in point cloud classification optimization based on MRF model, which is beneficial to create context relationships among local point clusters. In the existing MRF model, the local neighbor system is created by using the K-nearest neighbor, and the K-clusters with the closest spatial distance are clustered into a neighboring system. However, since only the spatial distance is considered, this method tends to propagate optimization errors in overlapping occlusion regions (for example, at the intersection of buildings and trees, it is easy for partially overlapping buildings to be optimized into tree types).
In order to solve the problem of classification optimization error propagation, the similarity relationship among clusters is calculated and clusters with high similarity are selected to construct a local optimal neighborhood system, as shown in Figure 6 (The red line connection constitutes the optimal neighborhood system, indicating the point cluster with higher similarity; dotted line connects the dissimilar point clusters need to be deleted from the neighborhood during the construction process). The proposed method is based on the obtained optimal local feature set, and then selects from K-nearest neighbors the clusters whose correlation satisfies the threshold p < 0.70 to construct an optimal neighborhood system. obtained. Moreover, after removing redundant features, the classification accuracy improved from 0.903 to 0.913, which indicates that feature redundancy will affect the classification result.

Label Refinement by Higher Order MRF
The above features are scaled into a range [0, 1] before being applied to the classifier. To recognize candidate objects from complex environment, a Random Forest (RF) classifier was used for point cloud classification. Specifically, the classifier was trained on manually labeled data; the proposed method classifies the entire scene through a trained RF classifier. Unfortunately, using only local features in prone to label noise, which means that the classification results lack consistency. We consider more context information to optimize the results. The MRFs can describe the relationship and interactions among adjacent data and are used to perform spatial context construction.
We formalize the solution of point cloud optimal classification label configuration as the maximum posteriori probability estimation problem of MRFs. Inspired by the work of computer vision, this problem can be naturally formulated in terms of energy function minimization which is designed as follows: * = + * where, is first-order data term which measures the disagreement between label and raw data, while second-order smooth term mainly describe the inconsistency of labels in local neighborhoods based on local context information; is the weight coefficient between the first-order potential and the second-order potential. In this paper, the point cloud classification results are obtained by solving the minimized energy function.
Local neighborhood construction is the most important part in point cloud classification optimization based on MRF model, which is beneficial to create context relationships among local point clusters. In the existing MRF model, the local neighbor system is created by using the K-nearest neighbor, and the K-clusters with the closest spatial distance are clustered into a neighboring system. However, since only the spatial distance is considered, this method tends to propagate optimization errors in overlapping occlusion regions (for example, at the intersection of buildings and trees, it is easy for partially overlapping buildings to be optimized into tree types).
In order to solve the problem of classification optimization error propagation, the similarity relationship among clusters is calculated and clusters with high similarity are selected to construct a local optimal neighborhood system, as shown in Figure 6 (The red line connection constitutes the optimal neighborhood system, indicating the point cluster with higher similarity; dotted line connects the dissimilar point clusters need to be deleted from the neighborhood during the construction process). The proposed method is based on the obtained optimal local feature set, and then selects from K-nearest neighbors the clusters whose correlation satisfies the threshold p < 0.70 to construct an optimal neighborhood system.  The probability distribution problem is transformed into the energy function problem, and then the optimal solution of the point cloud classification is obtained by minimizing the energy function. Minimizing the energy function is an NP-hard problem, and most state-of-the-art methods (e.g., Iterated Conditional Model and Simulated Annealing) achieve quite good results in terms of solution quality. However, for large-scale point clouds, using larger values of K will still bring a huge computational burden. For the classical algorithm, it needs to go through multiple iterations of small changes each time and the calculation efficiency is low. In this paper, the graph cut algorithm [36] is used to minimize the energy function. This method can make more changes to the label each iteration and reduce the number of iterations to achieve efficient energy optimization calculation.
The calculation of the energy function mainly includes first-order and second-order term. The first-order energy function mainly measures the inconsistency between the prediction and ground truth under a given feature set F. In this paper, the Random Forest algorithm (RF) is used to represent the energy function according to the posterior probability estimation of the local optimal feature, i.e., where, N c i is the number of votes for each class c i , and N T is the number of weak classifiers of RF.
In this paper, 200 is selected through cross-validation. The weight of the adjacent edge is calculated according to the adjacency relationship, and then the second-order energy function is calculated. The calculation formula is as follows: where, d(i, j) is the Euclidean distance of the cluster centroid, and σ represents the average of the spatial distance. In order to choose the optimal weight λ which is the coefficient of equilibrium data term and smooth data, we analyzed the impact of λ on the performance of labeling on Dataset A. The weights λ are set to 0.5, 0.75, 1.0, 1.25, 1.5, 1.75 and 2, respectively. As shown in Figure 7, the initial labeling results of buildings improves with the changes of the parameter λ. When the smoothing factor reaches 1.25, the F1-measure of the building tends to be stable, and the classification accuracy peak value reaches at λ = 1.5. Larger weight value means more costs imposed on the number of used categories, however, may lead to over-smooth results for labeling point clouds. Whereas a smaller λ means less penalty for the number of categories used in the region, which will result in a relatively large quantity of incorrect labels that cannot be effectively corrected. Properly setting the smoothing term coefficient to 1.5 can achieve balance and get the highest building classification accuracy, thereby obtaining promising fine labeling results.
The initial label is adjusted by the α-expansion algorithm [37], which mainly merges the wrong categories into the majority of the surrounding classes, thereby reducing the inconsistency of the local classification. The minimum energy function was solved by a graph cut algorithm to obtain optimized classification result. Then in order to compare the effects of the optimized neighborhood system, this paper also works under the ordinary neighborhood system and compares the two classification results, as shown in Figure 8. It is easy to find that the result based on K-nearest neighbors (as shown in Figure 8a) show that neighboring objects can easily cause error propagation in the occlusion area due to the method only considers the spatial distance and ignores the similarity between different types of objects. Because the optimized neighborhood system considers the similarity of local clusters, the classification results, can effectively avoid the propagation of the optimization errors at intersections. The initial label is adjusted by the α-expansion algorithm [37], which mainly merges the wrong categories into the majority of the surrounding classes, thereby reducing the inconsistency of the local classification. The minimum energy function was solved by a graph cut algorithm to obtain optimized classification result. Then in order to compare the effects of the optimized neighborhood system, this paper also works under the ordinary neighborhood system and compares the two classification results, as shown in Figure 8. It is easy to find that the result based on K-nearest neighbors (as shown in Figure 8a) show that neighboring objects can easily cause error propagation in the occlusion area due to the method only considers the spatial distance and ignores the similarity between different types of objects. Because the optimized neighborhood system considers the similarity of local clusters, the classification results, can effectively avoid the propagation of the optimization errors at intersections.

Building Extraction Based on Semantic Labels
The point clouds labeled as a building are extracted from the classification result of the scene. In order to obtain complete and independent building objects, clusters are merged into a single object  The initial label is adjusted by the α-expansion algorithm [37], which mainly merges the wrong categories into the majority of the surrounding classes, thereby reducing the inconsistency of the local classification. The minimum energy function was solved by a graph cut algorithm to obtain optimized classification result. Then in order to compare the effects of the optimized neighborhood system, this paper also works under the ordinary neighborhood system and compares the two classification results, as shown in Figure 8. It is easy to find that the result based on K-nearest neighbors (as shown in Figure 8a) show that neighboring objects can easily cause error propagation in the occlusion area due to the method only considers the spatial distance and ignores the similarity between different types of objects. Because the optimized neighborhood system considers the similarity of local clusters, the classification results, can effectively avoid the propagation of the optimization errors at intersections.

Building Extraction Based on Semantic Labels
The point clouds labeled as a building are extracted from the classification result of the scene. In order to obtain complete and independent building objects, clusters are merged into a single object according to connectivity relationship. Small clusters with less than 20 points are deleted to filter out then the points that are misclassified as buildings. The specific process is as follows: (1) extract a point set C marked as a building from the scene classification results; (2) select cluster are selected from

Building Extraction Based on Semantic Labels
The point clouds labeled as a building are extracted from the classification result of the scene. In order to obtain complete and independent building objects, clusters are merged into a single object according to connectivity relationship. Small clusters with less than 20 points are deleted to filter out then the points that are misclassified as buildings. The specific process is as follows: (1) extract a point set C marked as a building from the scene classification results; (2) select cluster C i are selected from C, and obtain C j based on the 4-NN search; then iterate all candidate clusters and determine whether the distance between C i and neighbor cluster C j is the Equation (8); if it does, two clusters can be merged, and the cluster C j is set to be clustered, otherwise C j was deleted; (3) If no new clusters were added, then a building object was clustered; (4) this process is repeated until all cluster are labeled to be completed. The building is eventually extracted.
where, D threshold = C i + C j indicates the size of the combination of two clusters.

Results
LiDAR data from three different complex and challenging scenes are used for qualitative and quantitative evaluation and analysis to verify the performance of the proposed framework optimization method. The experimental datasets are firstly introduced, in this section, then the proposed method is validated in experimental studies to present and analyze results with just mentioned datasets.

Experimental Data Description
To check the performance of the presented framework on LiDAR point clouds, we performed both qualitative and quantitative evaluations on three different data sets. The point clouds in dataset A are part of the urban scene in Hengdian, Zhejiang, China, collected through the SSW-MMTS mobile mapping system. As described in [13,38], the SSW-MMTS mobile mapping system integrates a laser scanner with a maximum range of 300 m, a navigation and positioning system, and six high-resolution digital cameras (22 million pixels each), installed on the roof of a minivan. The point density of the points in this area is about 77 points/m 2 . Dataset B was captured around urban and rural outdoor scenes in Zurich, Switzerland with 30 static terrestrial laser scanners, [21,39] explains the large-scale 3D outdoor benchmark datasets, having point number about 600 million 3D points with different densities and colorizing from camera image. Dataset C [40] is a total of about 8 million points acquired around Wuhan University campus in Wuhan, Hubei, China using SICK LMS291 Laser Scanner; the dataset has a low point density and belongs to a low-resolution laser scanning data. The collected point clouds of dataset C without color information due to lack of digital cameras and the amount of points is quite smaller than those in dataset A and B. In the above three different datasets, many objects are often incomplete due to mutual occlusion, which is extremely challenging. Our team and other collaborators carefully classified all points with the CloudCompare (http://www.cloudcompare.org/) tool to evaluate the performance of the proposed framework. All datasets are divided into training samples for learning procedure and testing samples for the evaluating the performance of the proposed methods.

Preliminary Results of Semantic Labeling Using Contexts and MRF-Based Optimization
During the learning phase, manually labeled points are used as input to train the RF classifier at each iteration. The number of decision trees and the depth of each tree in the RF are set at 100 and 15, respectively. The initial scene semantic labeling and the comparison with the spatially smoothed results of buildings (rendered with yellow color) for selected point clouds test data are shown in Figure 9. Figure 9a show the ground truths and are colored according to the labels of each point. Illustration of the initial semantic segmentation results of candidate objects are provided in Figure 9b, a small part of which are mislabeled points caused by local feature similarities, e.g., generally incomplete building façade incorrectly identified as trees. Figure 9 show the results of MRF classification optimization based on ordinary K-nearest neighbors. In Figure 9d, the spatially smoothed results of MRF classification based on optimized neighborhood are given.

Classification-Based Extraction of Buildings
After labeled the points of buildings, a classification-based segmentation is performed to extract all the buildings. The illustration of segmentation of buildings is given in Figure 10, urban environment is taken as an example, and the proposed method can effectively work for building objects extraction. However, due to the complexity of the datasets, some points are difficult to distinguish and mistakenly segmented. Figures 11-13 is the results of building extraction on dataset A, B and C.

Classification-Based Extraction of Buildings
After labeled the points of buildings, a classification-based segmentation is performed to extract all the buildings. The illustration of segmentation of buildings is given in Figure 10, urban environment is taken as an example, and the proposed method can effectively work for building objects extraction. However, due to the complexity of the datasets, some points are difficult to distinguish and mistakenly segmented. Figures 11-13 is the results of building extraction on dataset A, B and C.

Experimental Analysis
To quantitatively evaluate the performance of the proposed method for semantic labeling and recognizing buildings on these three data sets, four evaluation indexes were adopted in this study. The recall represents the percentage of completeness, while precision means the percentage of exactness. The overall accuracy (OA) reflects the overall performance on the test set, and the score were specifically used to evaluate the classification performance on each single class. They were defined as follows: Recall = + (10)

F = 2 Precision Recall Precision + Recall
where TP (true positive) denotes the number of objects labeled with correct classes; FP (false positive) represents the number of objects which are recognized, but not in the corresponding reference set; and FN (false negative) is the number of incorrectly classified objects; TN (true negative) is the number of negative samples that are correctly classified as negative [41]. Tables 2 and 3 show the results of quantitative analysis of K-nearest neighbors and optimal neighborhoods of these three data sets. On these three datasets, the classification accuracy of building objects in optimal neighborhood-based optimization is approximately 0.8%, 1.1% and 1.4% higher than that based on K-nearest neighbors, respectively. This further illustrated that the optimal neighborhood system has capability to handle the incompleteness and occlusion by considering the long-range contexts. In the light of buildings have a large proportion in the entire scene, the improvement of building classification accuracy has a significant impact on the classification accuracy of the scene. The proposed method (optimization based on optimal neighborhood) achieves good performance with overall accuracy of 95.9%, 94.3% and 84.7% for the three data sets, respectively. In particular, the results of classified buildings are well-pleasing.

Experimental Analysis
To quantitatively evaluate the performance of the proposed method for semantic labeling and recognizing buildings on these three data sets, four evaluation indexes were adopted in this study. The recall represents the percentage of completeness, while precision means the percentage of exactness. The overall accuracy (OA) reflects the overall performance on the test set, and the F 1 score were specifically used to evaluate the classification performance on each single class. They were defined as follows: where TP (true positive) denotes the number of objects labeled with correct classes; FP (false positive) represents the number of objects which are recognized, but not in the corresponding reference set; and FN (false negative) is the number of incorrectly classified objects; TN (true negative) is the number of negative samples that are correctly classified as negative [41]. Tables 2 and 3 show the results of quantitative analysis of K-nearest neighbors and optimal neighborhoods of these three data sets. On these three datasets, the classification accuracy of building objects in optimal neighborhood-based optimization is approximately 0.8%, 1.1% and 1.4% higher than that based on K-nearest neighbors, respectively. This further illustrated that the optimal neighborhood system has capability to handle the incompleteness and occlusion by considering the long-range contexts. In the light of buildings have a large proportion in the entire scene, the improvement of building classification accuracy has a significant impact on the classification accuracy of the scene. The proposed method (optimization based on optimal neighborhood) achieves good performance with overall accuracy of 95.9%, 94.3% and 84.7% for the three data sets, respectively. In particular, the results of classified buildings are well-pleasing.
Major parts of our method were implemented in C++ except that the semantic label and building extraction stage were finished using Python. Point Cloud Library [42], OpenCV [43] and Scikit-learn [44] are used in our program. Table 4 lists the processing time costs for each stage of our method. These results show that most of the total time was costed on generation of super-points on each dataset, because this step is a point-based process. Positively, the efficiency of subsequent processing is greatly improved thanks to super-points is used as the basic units.

Comparative Studies
To further demonstrate superiority of the proposed method, it is compared with the previous studies of [13,40,45] in terms of overall accuracy of semantic labeling for entire scene. In particular, the building extraction is highlighted as listed in Table 5. We compared the results of semantic labeling and construction extraction on dataset A with other recent methods (Yang et al. [13] and Zhang et al. [45]). Overall, the accuracy of our method reaches 95.9%, which is much higher than the latter. On dataset B, we compared the proposed method with two following works: Yang et al. [13] and Zhang et al. [45], and the proposed method also achieved the highest accuracy for semantic segmentation and classification accuracy reached 95.4%, which was slightly higher than the results achieved by other two methods. For sure, for all of the comparison methods, building extraction has obtained satisfactory results. For point cloud classification on dataset C, some other methods, including Yang et al. [13] and Wang et al. [40] are compared with the proposed method. It is noted that our proposed method achieves the best results of objects recognition and building extraction.

Conclusions
This paper has presented a method for effectively conducting semantic labeling and building extraction from LiDAR point cloud, including (1) separating the ground and non-ground points using an advanced existing filtering approach; (2) generating spatially consistent super-points, rather than individual points, are generated from non-ground points; (3) extracting different features based on the super-points neighborhood and selecting some optimal features then using them for point classification; (4) obtaining the initial semantic labeling results using the random forest classifier and refining the initial results based on optimized neighborhood by considering more contexts; (5) building extracting according to the semantic labeling results. The main contributions of proposed approach are as follows: non-ground points are over-segmented to generate super-points to improve the estimation of local geometric features of neighboring points and the segmentation efficiency; select local feature sets for semantic segmentation to remove features with weak classification ability and achieve the best performance of building extraction; design a MRF model introduced high-order contextual information for the refinement of the classification; a hierarchical segmentation strategy is robust to noise and situation of occlusion and overlapping. Experiments on three different datasets prove that this method had good applicability for building extraction from point cloud in complex environments.
Future work will be refined in the following aspects: reducing the number of manual parameters in our proposed model effectively to further strengthen the generalization ability; generating multiscale super-points toward better boundary and preserving small structures in an effective way to cut down time cost; considering more multiple levels features and contextual features to enhance descriptiveness; using a higher MRF model to optimize results of semantic segmentation by taking into account the long-range contexts among local variables; extracting buildings directly from the scene and performing instance segmentation.