1. Introduction
In processing digital terrain models (DTM) and 3D city and landscape models, point clouds have become a more and more popular type of data. For photogrammetry, common point clouds can be produced by airborne laser scanning (ALS) [
1,
2] and by dense matching of aerial photographs [
3]. No matter which method is chosen, the classification of point clouds cannot be ignored. It is the first step in extracting productive geo-information. In some productions, such as DTM generating, points only need to be classified into two classes. In other processes, such as city reconstruction, points require classification into multiple categories. Some existing classification tasks are implemented on a point-based method, while other works proposed to use a segment-based method [
4]. Point-based methods use the information of each point with reference to its neighbor, such as eigenvalue-base features, point density values, and the direction of normal vector, or information based on the point itself, such as intensity value and echo-based features, to obtain accurate classification results. On the contrary, segment-based methods divide the point cloud into segments first and put the class label into each segment within which all points belong to the same category.
Admittedly, segment-based classification methods outperform point-based methods in some respects. Above all, a segment-based method is always a timesaving method. The number of segments is much smaller than the number of points. Although dividing point clouds into segments may initially take some time, segment-based methods put a label into each segment which contains hundreds of points. The time of generating features and labeling will then be reduced. Secondly, segments may contain more extending features than a single point within its local neighborhood. Features such as segment size, segment point density, and average echo number in a segment can then be used. The categories’ separability may be improved by these features.
The advantages of segment-based classification cannot be realized without good segmentation. Under- and over-segmentation errors will negatively affect the classification accuracy [
5]. Undeniably, under-segmentation will cause classification errors as all the points in a segment belong to the same category. Meanwhile, over-segmentation will add computational effort and reduce the reliance of the segment-based features.
In this paper, a segment-based method is used to reduce the computational burden of our previous work [
6], which is a point-based convolutional neural network labeling method. The scientific contributions of this study are as follows:
- -
We propose a three-step region-growing segmentation method for segment-based classification. We divide the segmentation into three steps in order to provide a good starting point for the following procedure.
- -
We also develop our convolutional neural network. A multi-scale convolutional neural network is trained to automatically learn deep features of each point from the generated feature images across multiple scales.
The paper is structured as follows: the related work to this subject is discussed in
Section 2. We present our methodology in
Section 3.
Section 4 presents the experimental results. The results from using our method is compared with those found using state-of-the-art segment-based and point-based methods. A discussion of our experiments is in
Section 5. We provide concluding remarks and suggestions for future work in
Section 6.
2. Related Work
In order to understand the relationship between data and labels, some modern discriminative methods are provided. Adaboost [
7], support vector machine (SVMs) [
8], random forests (RFs) [
9], conditional random fields (CRFs) [
10], and deep convolutional networks (DCNNs) [
11] are popular ones. These methods are also used for the ALS data. The classification methods of the ALS data can be divided into two categories: the point-based classification and the segment-based classification [
5].
For point-based classification, an Adaboost [
12] algorithm, which automatically combines rough guesses to a more accurate hypothesis, was used to label the 3D ALS data into four classes. Five features were used in the classification. A SVM classifier was used in Mallet’s work [
13]. It is a point-based method for LiDAR data. SVM is a non-parametric method. It performs well especially in non-linearly separable data. The potential of LiDAR data is developed by using the SVM method. Chehata [
14] used the RF method to classify the LiDAR data. Random forests can make full use of the multi-echo and the full-waveform LiDAR features and provide an accurate classification result in an efficient way, even if the datasets are large. For a thorough discussion of supervised classifiers, Weinmann [
15] applied 10 different methods to a same procedure to evaluate their performance. Both methods treat each point independently. In more complex places, such as urban areas, this drawback may lead to inhomogeneous results, as mentioned in Niemeyer’s work [
16]. In urban areas, many different objects appear in even a small scene. Roofs and other challenging objects, like cars, fences, and hedges, may have many details, causing overlapping distributions of features in each class. Errors such as shadows caused by other objects, missing data, and random errors make the problem bigger.
In order to overcome these problems, the contextual information that contains the relationship between 3D points within a neighborhood is introduced into the classification of ALS data. The relationships between the object classes can be trained to make the results better. For example, a facade is more likely to appear next to a roof, and a fence is more likely to appear on top of grass. Probabilistic graphical models, such as the conditional random field model, are used for that reason. Niemeyer [
17] presented a point-based CRF classifier for urban ALS data. He used a graphical model to represent the point cloud. The edges of the graphical model link each point to its 2D neighbors. The relationships between object categories and the datasets are learned in a training step by making use of a complex model. By comparing the classification result to the methods without contextual information, the CRF method achieves a smoother and more accurate result, even for the classes that come in a low quantity like garages and pavilions.
There are also some problems in the pairwise CRF method. The interactions only occur at a very local level. Thus, some isolated clusters of points may be classified into wrong classes. Many researchers have improved the CRF method to handle these missing long-range interactions. Luo and Sohn [
18] presented a multi-rand and asymmetric conditional random field (maCRF). In maCRF, prior information of scene-layout compatibility is used to handle the long-range dependency problem. The maCRF combines two CRF models. One is the short-range CRF for the local neighborhood, and the other is a long-range CRF for the long-range interactions. The final results are refined by independently using the output of the two models. Another solution is proposed by Xiong [
19]. A multi-stage inference procedure is used to handle the difficulties in modeling the contextual relationships between the 3D points. A segment result is achieved by using a point-based classification first. Then, the contextual information is presented using the segment result for the final point-based classification. This P
N Potts model is proposed by Kohli [
20]. The mutually propagating and iterating contextual information improved the classification results. Local spatial interactions can be restricted by the P
N Potts model. In a large scale, some potential misclassification may be revised.
The convolutional neural networks also take the contextual information into consideration for point-based classification tasks. Boulch [
21] picked several snapshots of the point cloud. For each snapshot, an RGB and geometric composite image was generated. The 3D data was then transformed into 2D images. The fully convolutional neural network was trained by these images and used for pixel-wise labeling. Caltagirone [
22] applied a simple and fast fully convolutional neural network (FCN) to assist with road detection. Top-view images encoding several basic statistics, such as mean elevation and density, were generated. The FCN is specifically designed for the task of pixel-wise semantic segmentation by combining a large receptive field with high-resolution feature maps. Yousefhussien [
23] presented a 1D FCN to generate point-based labeling while implicitly learning contextual features in an end-to-end fashion. Yang [
6] presented point-based feature image generation for the CNN. For each point in the ALS data, a neighboring point with a window was extracted. Using their point-based features, the feature images containing the contextual information were then generated. Relationships between the point and the feature image were learned by the CNN model.
Another way to improve the labeling results and reduce time cost is segment-based classification. More stable features can be achieved since segments may contain some extending features compared to a single point within its local neighborhood. Furthermore, the number of segments is much smaller than the number of points, so time will be saved even though the segmentation process is added. Golovinskiy [
24] presented a system for detecting objects such as traffic lights or cars using the combined terrestrial and ALS data. First, the potential object locations were determined based on a hierarchical clustering method. Then, a graph-cut-based segmentation was applied to classify the points close to these locations into the foreground and background. The segmentation method required the parameters of the segments, such as the maximum radius, to be set in advance. The points in the foreground segments were treated as objects while the points in the background segments were discarded. The feature vectors were calculated based on context and shape information and applied to a classifier. Shapovalov [
25] used the k-means method to perform the segmentation. Each point was treated as a leaf of a tree and a heuristic method was used to reduce the computational burden. Since the k-means method only defines the total number of segments, it leads to strong over-segmentation. A graph over medoids of segments was built, and the edge values were determined by analyzing the k-nearest neighbors of the medoids. A naïve Bayes classifier was used to define the pairwise potentials. Features such as deviations of the surface normal of the segments and the geometrical arrangement of the medoids were considered. The experiment result showed that the segment-based method can remove noise, increase efficiency, and make use of natural edge features. In Xu’s work [
4], single points and two types of segments acquired by different methods were treated as entities for the classification method. Features such as z variance, distance ratio, segment size, and normal direction were calculated from these three entities. The classification was based on heuristic rules, and the contextual information was considered using the segment-based methods. Niemeyer [
26] merged the spatial and semantic context in a two-layer CRF. The output of the first CRF was used to generate segments. The segments contained larger scale context and was introduced as an energy term for the next iteration of the next CRF layer. Guinard and Landrieu [
27] proposed a non-parametric segmentation model for the classification of 3D LiDAR point clouds in urban areas. The high-level structure of the area was captured by integrating the segmentation into the CRF. The segment-based method aggregated the noisy predictions of a weakly-supervised classifier and produced a higher accuracy result. Vosselman [
5] thought that different segmentation methods may be good at different object classes. Thus, a hierarchical structure containing two different segmentation methods was proposed to obtain a generic technique for the ALS data. The structure was capable of handling complex urban areas with a large number of categories. The combination of small and large segments produced by the hierarchical structure made the interaction between nearby and distant points possible. The contextual information was learned by using a CRF. The edge value of the graph was defined by the boundaries of the segment rather than the medoids [
25]. The features extracted by analyzing the boundary of the segment were added to improve the classification accuracy.
This paper is based on our previous work [
6]. We changed the point-based method to a segment-based method. A three-step region-growing method was proposed for the segmentation. Feature images in different scales were generated and these feature images were treated as the input of a multi-scale convolutional network, and the CNN model was trained for the final semantic labeling task.
5. Discussion
The original purpose of our proposed method was to reduce the computation burden of the point-based method in our previous work [
6]. As shown in
Table 3 and
Table 4, the three-step region-growing segmentation strategy has a good performance compared with the MCNN. Although the MCNN needs more feature images, it did improve the overall accuracy and the average F1 score of the classification result (e.g., comparing Point_S with Point_M and SegT_S with SegT_M). The time efficiency of the framework was closely related to the number of test feature images. The segmentation-based strategies reduced the test feature images completely. The total test feature image number of SegT_M was only one tenth that of Point_M and one fifth that of SegS_M. Although the segmentation method may cause some time, the testing efficiency of the framework is indeed improved. The overall accuracy and the average F1 score also have a good result.
Compared with Point_S [
6], our method shows a better classification performance. For the large planar objects, such as low vegetation, impervious surfaces, and roofs, the mistaken mix of different categories was solved. As shown in
Figure 7, since the planarity of the low vegetation and the impervious surfaces had little differences, these classes may be mixed together in a point-based classification method. Basing on the intensity values and the normal angle differences, these objects have been clustered together. Compared with Point_S, our method (SegT_M) improved the results of categorizing low vegetation (+3.3%) and impervious surfaces (+0.5%). For the small non-planar objects, such as fences, hedges and cars, our method has better classification results over the point-based method. As shown in
Figure 8 and
Figure 9, the point-based method may misclassify fences, hedges, and cars into shrubs and other classes. In the second and third steps of our segmentation, planarity values were used to cluster these objects. The whole segment served as an integer for the following classification procedure. Compared with Point_S, our method (SegT_M) improved the results of classifying the fence/hedge (+27.8%) and car (+9.6%) categories. In some areas, as shown in
Figure 10, trees and roofs are hard to distinguish because of the unusual distribution of their points and the similarity of their planarity. The segmentation-based strategy can solve this problem to some degree. Compared with the Point_S, our method (SegT_M) improved the classification of the tree category (+5.2%).
As shown in
Table 5 and
Table 6, our method had a satisfactory performance over all participants on the ISPRS WG II/4 Vaihingen 3D Semantic Labeling task. Its overall accuracy and the average F1 score were ranked 1st of all participants. Based on our three-step segmentation strategy, planar objects, such as low vegetation (ranking 1st), impervious surfaces (2nd), and roofs (1st), and smaller objects, such as cars (1st), fences/hedges (1st), and shrubs (1st), had a good performance in F1 score. The multi-scale convolutional neural networks exploited the potential of the selected features, as we expected. There were also some misclassifications in our final result. As shown in
Figure 11, shrubs and low vegetation were difficult to distinguish. Some shrubs were mixed up with trees and low vegetation. This is because, in the segmentation procedure, these points were clustered into the same segment. Our method needed to adjust several parameters in the segmentation step. It was hard to make the segmentation result suit all the categories. A more automatic and universal segmentation method should be proposed. Furthermore, only LiDAR was used in our experiments. In order to improve our further classification performance, the corresponding orthoimages could be used in our future work.