MSNet: Multi-Scale Convolutional Network for Point Cloud Classiﬁcation

: Point cloud classiﬁcation is quite challenging due to the inﬂuence of noise, occlusion, and the variety of types and sizes of objects. Currently, most methods mainly focus on subjectively designing and extracting features. However, the features rely on prior knowledge, and it is also difﬁcult to accurately characterize the complex objects of point clouds. In this paper, we propose a concise multi-scale convolutional network (MSNet) for adaptive and robust point cloud classiﬁcation. Both the local feature and global context are incorporated for this purpose. First, around each point, the spatial contexts of different sizes are partitioned as voxels of different scales. A voxel-based MSNet is then simultaneously applied at multiple scales to adaptively learn the discriminative local features. The class probability of a point cloud is predicted by fusing the features together across multiple scales. Finally, the predicted class probabilities of MSNet are optimized globally using the conditional random ﬁeld (CRF) with a spatial consistency constraint. The proposed method was tested with data sets of mobile laser scanning (MLS), terrestrial laser scanning (TLS), and airborne laser scanning (ALS) point clouds. The experimental results show that the proposed method was able to achieve appreciable classiﬁcation accuracies of 83.18%, 98.24%, and 97.02% on the MLS, TLS, and ALS data sets, respectively. The results also demonstrate that the proposed network has a strong generalization capability for classifying different kinds of point clouds under complex urban environments.


Introduction
Point clouds are widely available now due to the progressive development of various laser sensors and dense image matching techniques. The efficient classification of point clouds is one of the fundamental problems in scene understanding for three-dimensional (3D) digital cities, intelligent robots, and unmanned vehicles. However, classifying point clouds under complex urban environments is not a trivial task, since they are usually noisy, sparse, and unorganized [1]. The density of point clouds varies with the sampling intervals and ranges of laser scanners. Moreover, severe occlusions between objects during scanning can lead to incomplete coverage of object surfaces. These problems present challenges for point cloud classification.
Point cloud classification can be accomplished via various approaches, such as region growing, energy minimization, and machine learning. A review of these approaches can be found in Nguyen due to the occlusion among objects, directly projecting the point cloud to imagery inevitably loses the depth and 3D spatial structure information. Another way to apply CNN to a point cloud is by voxelizing the entire unorganized 3D scene into a 3D array of point clouds. This would allow using some classical image semantic labeling networks such as FCN [35], SegNet [36], Deeplab [37], and DeconvNet [38] for point cloud classification by extending 2D convolution kernels into 3D ones. However, a point cloud is not actually the "3D data" of whole solid objects; rather, it is a recording of the objects' surfaces, which actually is a manifold that is embedded in the 3D space. Except for the objects' surfaces, the 3D space is filled with enormous null data. Simply voxelizing the entire 3D point cloud into a regular 3D array can lead to huge unnecessary computation. Therefore, an efficient network constructed directly on the point or voxel has increasingly become a topic of interest among research studies [39][40][41].
One of our arguments is that objects of various sizes exist in a point cloud. For objects that are small in size, a fine scale within a small neighboring region is enough, while large objects require a coarse scale containing a large region. To adapt to the varying sizes of objects, multi-scale voxelization is proposed in this paper to "observe" the small neighbors finitely and the large neighbors roughly. Instead of dividing the whole space into voxels with a fixed size, multi-scale voxelization divides a point cloud into voxels at multiple scales, thereby allowing the multi-scale features of objects of various sizes to be extracted based on those voxels. Also, the spatial context information at different scales is integrated during multi-scale feature extraction.
Based on multi-scale voxelization, we further propose a multi-scale convolutional network (MSNet), with the aim of efficient feature learning and class prediction. In our proposed method, only the position information (x, y, and z coordinates) of a point cloud was considered, as the intensity or RGB information is not always available. Among the neighboring voxels around a point, MSNet is proposed for the discriminative feature learning of its local context. The multi-scale features of different spatial resolutions are learned with convolutional networks and fused across different scales. Meanwhile, as a result of multi-scale voxelization, the spatial context with different sizes is captured at different scales by the 3D convolution kernels of MSNet. With this strategy, the conventional pooling operation is not necessary in MSNet to robustly capture the multi-scale features, so the structure of MSNet becomes concise to implement. However, the classification of point clouds with MSNet is voxel-level, and inevitably influenced by noise. As such, a CRF that fully considers the spatial consistency of a point cloud is applied to achieve a global optimization of the predicted class probabilities. Therefore, our method incorporates both local and global constraints for a highly accurate point cloud classification.
The remainder of this paper is structured as follows. Section 2 starts with multi-scale point cloud voxelization. MSNet is then established for the discriminative local feature learning, and global label optimization with CRF is employed. Section 3 starts with the experimental data description, followed by a presentation of the individual and overall test results to demonstrate the solution procedure. Section 4 is a series of discussions where we compare our proposed MSNet with some state-of-the-art methods, and analyze the generalization capability of the proposed approach. Both quantitative and qualitative evaluations are presented. Section 5 consists of our concluding remarks on the properties of MSNet and our proposed future efforts.

Materials and Methods
As depicted in Figure 1, the proposed method consists of two complementary parts. In Part I of Figure 1, a point cloud is represented as multi-scale 3D voxels. Then, a corresponding MSNet is established for discriminative local feature learning to predict a class probability. In Part II of Figure 1, the point cloud is regarded as an edge-weighted graph, and a CRF with spatial consistency constraints is constructed to obtain the global context. Finally, global label optimization is used to combine the local feature and the global context for accurate classification of the point cloud.

Multiscale Voxelization
Humans perceive the object context of point clouds at multiple scales, including the scene at a coarse scale, and then objects, structures, edges, and points at a fine scale. It is a multi-scale observation process that considers information across different scales to enable comprehensive judgment. Similarly, to automatically understand point clouds across different scales for discriminative feature learning, the context of a point cloud is analyzed by multi-scale voxels that are centered at the point, which allows the network to closely observe at a fine scale, and a consider rough view at a coarse scale.
At each scale, for a given point P(x, y, z), a neighboring cubic [x − 0.5R, Figure 2a. Then, the cube is subdivided into n × n × n grid voxels [42] as a patch, and the side length of each voxel is r = R/n. R corresponds to the size of the neighboring region. The smaller the r is, the finer the scale. For small objects, a fine scale within a small neighboring region R is enough, whereas for large objects, a coarse scale including a large region is needed. On the contrary, observing small objects at coarse scale will omit details, while the processing of large objects with a fine scale may lead to high noise sensibility and large unnecessary computation. Therefore, dividing the point cloud with multiple scales is necessary to accommodate the various sizes of objects. multiple scales. Around point P(x, y, z), the spatial contexts of different sizes are partitioned as voxels of different scales, respectively. The points near P(x, y, z) are more important, and are characterized with finer scales in several voxels than those far away from it.
According to the aforementioned analysis, we present the design of a multi-scale voxelization frame in this paper. The patch length n of different scales is fixed to an equal number that is as small as possible for computation efficiency. Then, a series of voxel side lengths {r 1 , r 2 , . . . , r S } with increasing values are applied. With a larger r ∈ {r 1 , r 2 , . . . , r S }, the cubic side length R also will increase and yield a coarse view in a large region, and vice versa. Instead of dividing the whole space into voxels of a fixed size, the multi-scale voxelization divides the individual context of the point cloud into voxels at multiple scales, and the spatial context information of different scales is well represented for each point. As shown in Figure 2b, with an equal patch length n of all of the scales, the spatial context with different sizes are obtained by changing the voxel length r i (i = 1, 2, . . . , s) without compromising the computation efficiency.
Without loss of generality, the point density of each voxel, which is defined as the ratio of the point count within the voxel and its volume, is adopted as the representative value of the voxel. For fine voxel volume, the commonly-used occupancy value (i.e., if there is a point inside the voxel, the value is set as 1, otherwise it becomes 0) is reasonable, as the voxel is small, and there are only a few or no points that lie in it. Whereas, for the coarse voxel volume, the number of points that lie in the voxels may vary greatly, and cannot be simply approximated as 0 or 1. Compared with the occupancy value, the point density characterizes the degree of point occupancy within a voxel, and is invariant to scale.

Multi-Scale Convolutional Network of a 3D Point Cloud
Based on the multi-scale voxelization, MSNet is proposed for discriminative local feature learning and class probability prediction, as shown in Figure 3. With the multi-scale voxelization of point clouds, the multi-scale features of different spatial resolutions are learned with a series of convolutional networks (ConvNets) of shared weights that are fused directly across different scales. Due to multi-scale voxelization, the 3D convolution kernels of MSNet capture the spatial context with different sizes at different scales. Thus, the cascaded pooling operation is not necessary, and a concise structure of less model parameters is proposed in MSNet.

Multi-scale Feature Extraction
Many excellent discriminative feature extraction methods have been proposed for point cloud classification [16][17][18]. However, most of them are "knowledge-driven", and are designed subjectively based on prior knowledge. Due to the influence of noise, occlusion, and various types and sizes of objects, these subjectively-designed features are difficult to use for characterizing the objects in a point cloud.
Owing to its convolution and pooling layer, CNN has recently been shown to have a powerful feature learning capability in the classification and semantic labeling of 2D images [28,29]. The kernels of the convolution layer simulate the receptive fields of human vision, while the pooling layer is applied for dimension reduction and an invariance guarantee of translation, rotation, and scale.
However, it is difficult to directly utilize the conventional CNN for 3D point cloud classification. CNN needs an input of regular 2D or 3D array, but when the 3D point cloud is simply projected into 2D imagery, the 3D point cloud loses its 3D spatial context information. Dividing the point cloud into a regular 3D array cannot adaptively reflect the different sizes of the objects in the point cloud, and also will lead to large unnecessary computation on null values inside the object, even with the following pooling operation.
To address these problems, MSNet is proposed based on the 3D multi-scale voxelization of the point cloud. By simultaneously applying the ConvNets at multiple scales, the multi-scale contextual features of the objects of different sizes in the point cloud are extracted with discriminative feature learning. The ConvNets at different scales operates within the spatial context of different region sizes, which acts as the cascaded pooling operations in normal CNN. Therefore, the pooling layer is not necessary in MSNet, and a less deep structure is achieved due to the simultaneous convolution at multiple scales.
At each point (x, y, z) in the 3D scene, we first construct the corresponding multiscale 3D voxels according to Section 2.1. Denote the patch of scale s ∈ {1, . . . , S} as V s ∈ R n×n×n×q 0 . In the superscript of R, the last dimension of V s represents the number of features. In this paper, only the voxel's density is considered, leading to the last dimension q 0 = 1. For each scale s, the 3D Convnet F s can be described as a sequence of linear transforms and non-linear activations. For the 3D Convnet F s with L layers, the m-th output feature map of layer l ∈ {1, 2, . . . , L} can be represented as: s,l ∈ R f l × f l × f l is the convolution kernel with a size of f l , q l is the feature number of hidden layer H s,l , and * represents the 3D convolution operator. Relu(·) = max(·, 0) is the activation function acting on each element of the input matrix, which leads to the non-linearity of the network and reduces the vanishing of the gradient and fast training. In addition, the contribution of the neighboring voxels to the center one is similar across different scales, and depends on their spatial relationship. To capture this character and improve the generalization capability of our MSNet, the weights of ConvNets are shared across different scales to reduce the number of model parameters, which also makes the MSNet concise.
The output of the 3D Convnet F s is obtained as: It is regarded as the feature of point (x, y, z) at scale s ∈ {1, . . . , S}. A detailed convolution process at a single scale is provided in Figure 4.
Finally, the outputs of the S-scale ConvNets are flattened and fused to produce the final feature vectorF, which can be seen as the multi-scale feature around point (x, y, z): where f latten(·) is the flatten function to stretch the matrix to be a vector, W f represents the full connection parameters, and b f is the corresponding bias. . The three-dimensional 3D convolutional networks (Convnet) F s with L = 6 layers. The input of the network is a patch with q 0 feature channels. A sequence of convolution kernels are applied for multi-layer feature learning (without padding for high computation efficiency), and the size of the output at l-th layer is denoted as n l = n l−1 − f l + 1, l = 1, 2, . . . , L, and n 0 = n. The final output is a feature vector (n 6 = 1). To this end, the kernel size of the last layer is the same as the output of the former layer ( f 6 = n 5 ).

Discriminative Feature Learning
With the fused multi-scale featureF, our goal is to use it for class probability prediction. To this end, we apply softmax regression to predict the probability distribution p over each class as: wherep i,k is the predicted probability for the i-th point belonging to class c k ∈ C, C = {c 1 , c 2 , . . . , c C } is denoted as the set of classes, and C represents the number of classes. Next, we construct the loss function using cross entropy, which depicts the difference between the probability distributionp i,k and the true probability distribution p i,k of class c k : where the parameters Θ = minimizing Loss(Θ) with a batch stochastic gradient descent algorithm. Once the network is learned, the loss function is no longer needed, and the predicted probabilities are used for further class label inference.

Global Label Optimization with CRF
Point cloud classification must assign each point a unique label that indicates its class. The simplest strategy for this end is to give each voxel a label with the argmax of the predicted probabilities (Equation (4)). Then, the label is assigned to the points in the corresponding voxel. However, such classification results are at the voxel level, and are inevitably influenced by noise and result in the spatial inconsistency of label prediction.
To address this issue, we use a CRF model with spatial consistency to globally optimize the class label of the point cloud. For this purpose, we construct a graph G(V, E) with vertex v ∈ V and edge e ∈ E. Each vertex is associated to a point, and the edges are added between the point and its K-nearest points of the point cloud.
Let random variable X i be the label of point i. Random variable X consists of X 1 , X 2 , . . . , X N , where N is the total number of the points. We regard vertex V of the graph G(V, E) as the random variable of label (i.e., V = {X 1 , X 2 , . . . , X N }). We can constitute the CRF model (P, X) based on the graph G(V, E) of the point cloud, where P is the global observation of G(V, E). P = {p 1 ,p 2, . . . ,p N } corresponds to the predicted class probability of the point cloud, which is obtained with MSNet. The posterior probability of the point cloud that is assigned to label l, which consists of l 1 , l 2, . . . , l N (l i ∈ C = {c 1 , c 2 , . . . , c C }) under the global observation P, is then represented as below: where Z(P) indicates the normalized index, and the energy of label l can be represented as: Label l maximizing the posterior probability p( X = l|P ) is the most appropriate label of the point cloud, whereas maximizing the posterior probability in Equation (6) is equal to minimizing the energy in Equation (6), which leads to a global optimization of the point cloud label.
The data cost term ϕ(p i , l i ) penalizes the disagreement between a point and its assigned label. In this paper, the initial data cost of each point is calculated with its predicted probability in Section 2.2 as unary terms: where δ(·) is an indicator function. The data cost enforces the value of label l close to the predicted probability. The smooth cost term ψ(l i , l j ) penalizes the label inconsistency between neighboring points. The neighboring points are encouraged to assign similar labels. In this work, the K-nearest neighboring points are connected with the central point. The smooth cost is calculated according to the Euclidean distance between two points: where d i,j is the 3D Euclidean distance between points i and j. The smooth cost constrains the regularity and consistency of label l. Finally, the energy function E(l|P) is minimized with the α-expansion [43][44][45] algorithm. A simple diagram of the optimization process is provided in Figure 5. The initial probabilities of each point are pre-predicted with MSNet, as described in Section 2.2. After several iterations, all of the class labels of the point cloud are globally optimized. Figure 5. The global optimization aims to give each point a spatially consistent class label with the conditional random field (CRF). Assuming there are two classes R and B, the points in (a) describe the predicted probability of each class, which contributes to the initial data cost. The red and blue points totally belong to class R and class B, respectively. The color of the other points is mixed with the degree of red and blue, which is decided by the probability belonging to class R and class B, respectively. After global optimization, each point is assigned to be a certain class as shown in (b).

Experimental Data
Both mobile laser scanning (MLS) and airborne laser scanning (ALS) point clouds were used to evaluate the proposed method, and included objects of different sizes and scanning densities. They were acquired from the same area of Wuhan University (WHU), China, and are available at https://github.com/wleigithub/WHU_pointcloud_dataset. An overview of the experimental area is shown in Figure 6, and the experimental data are provided in Figure 7. The properties of these data sets are summarized in Table 1.
The MLS point cloud with two blocks (block I and block II) was obtained with SICK LMS291 Laser Range Finder in March 2014. They are labeled into seven classes, which included vegetation (e.g., tree and grass), buildings, cars, pedestrians, lamps, and fences. The point density of the MLS point cloud varied a lot with the different distances between the objects and scanners. Moreover, the point cloud of many of the objects often was incomplete due to mutual occlusion. The ALS point cloud (block III) was acquired in Wuhan, China in July 2014 by a Y5 plane carrying an H68-18 airborne laser radar system with a mean flight altitude of 800m above the ground. The point density in the experimental area was approximately 5-10 points/m 2 . Compared with the MLS point cloud, the ALS point cloud was more sparse, and much more fragmented. Therefore, only three classes (i.e., vegetation, buildings, and cars) were recognizable in block III scanned with the ALS. Additionally, ground points, which were a large portion of the point cloud, were previously removed artificially in the experimental data sets. All of the data sets were divided into training and testing data by a randomly chosen plane. Human operators carefully labeled the data sets with the CloudCompare (http://www.cloudcompare.org/) tool. Figure 7 provides an overview of the three blocks (2,330,834, 5,023,784, and 804,836 points respectively).

Experimental Results and Assessment
At each point in 3D space, the corresponding multi-scale voxels were constructed among its neighbors. For the training samples, each voxel was assigned a unique label based on the majority of the neighboring labels in the voxel. For equal numbers of labels, a label was chosen randomly among them. Meanwhile, to simulate objects with different orientations, the training samples were randomly rotated around the voxels' central vertical axis. Additionally, k-fold cross validation was used for all of the experiments.
Considering both performance and efficiency, deeper layers would lead to higher accuracy, but more computation expense at the same time. L = 6 hidden layers were chosen and applied. In addition, n was fixed at 11 for computation efficiency, f 1 = f 2 = f 3 = f 5 = f 6 = 3 for feature learning, and f 4 = 1 for feature dimension reduction. In this way, the output of each scale corresponded to a feature vector. Moreover, the seven nearest neighbors of each point are searched to construct the graph G(V, E) for global label optimization.

Classification of Point Clouds
The MLS point cloud in Figure 7a,b contains different objects, such as vegetation, buildings, cars, pedestrians, lamps, fences, and others (sculptures, roadblocks, and trash bins). To classify them, we used S = 5 scales for point cloud voxelization due to the wide range of object sizes, and n was fixed at 11 for computation efficiency. Based on the object sizes and the point cloud density in the space, the side length of the finest scale was set as r 1 =0.035 m (i.e., slightly larger than the average resolution of the point cloud). Additionally, similar to most multi-scale strategies, we set the side length of the next scale as two times that of the current scale, which was formulated as r i+1 = 2r i , i = 1, 2, . . . , S − 1. Then, the side lengths of the other four scales were derived as r 2 = 0.07 m, r 3 = 0.14 m, r 4 = 0.28 m, and r 5 = 0.56 m, respectively. These scales allowed us to characterize the point cloud with five different scales. The finest scale was r 1 = 0.035 m, and the neighboring region around it was 0.385 × 0.385 × 0.385 m, and focused on the details. The coarest scale was r 5 = 0.56 m, and the spatial context within the neighborhood was 6.16 × 6.16 × 6.16 m 3 .
For the ALS point cloud with lower density and larger objects, the side length of the finest scale was set as r 1 =0.07 m. Similar to the experiment of the MLS point cloud, five scales were applied for point cloud voxelization, and n was fixed as 11. The side length of the next scale was set as two times that of the current scale. Therefore, the side lengths of the other four scales were r 2 = 0.14 m, r 3 = 0.28 m, r 4 = 0.56 m, and r 5 = 1.12 m, and the side lengths of the neighboring regions at each scale were calculated as R 1 = 0.77 m, R 2 = 1.54 m, r 3 = 3.08 m, R 4 = 6.16 m, and R 5 = 12.32 m, with R s = nr s , s = 1, 2, . . . , 5.
The ground truths and the corresponding classification results with the proposed method are shown in the first and last columns of Figure 8. It can be seen from the first two rows of Figure 8 that the MLS point clouds containing vegetation, buildings, cars, lamps, and fences were correctly classified, despite their different sizes and shapes. Due to the multi-scale and discriminative feature extraction capability of the proposed MSNet, the spatial context of the objects in Figure 8 were well characterized. Objects, such as lamps in vegetation, cars with incomplete shapes, and fences under trees in Figure 9, were also correctly classified, although they were partially occluded. For the sparse ALS point cloud, the proposed method also exhibited a satisfactory classification result. The predicted classification results without global label optimization are shown in the second column of Figure 8. The comparison results show that even though the CRF did not significantly improve the classification result, it efficiently suppressed the influence of noise, and guaranteed the smoothness of the classification result, which was beneficial for further object detection, reconstruction, etc.
However, there were some situations that still were difficult to classify. These situations involved uncommon structures (e.g., the gatehouse and flower bed in region A, and the spheroidal roof in region D), and insufficient sampling (e.g., glass refraction in region B and the distant scanning of cars in region C). As shown in Figure 10, they were mistakenly classified due to the lack of sufficient training samples or scanning coverage.

Assessment
Three metrics were used to quantitatively evaluate the performance of the proposed method. Precision is defined as the percentage of correctly classified points in the classification results, which is sensitive to the number of spurious points. Recall is defined as the percentage of correctly classified reference points, which is sensitive to the number of missed points. To further give a global assessment, the accuracy, which is the percentage of reference point cloud labels that are correctly predicted, was also considered in this paper. The three metrics are defined as: where TP is the number of true positives (i.e., the number of points both in the reference and classification), FP is the number of false positives (i.e., the number of classified points not found in the reference), FN is the number of false negatives (i.e., the number of reference points not found in classification), and TN is the number of true negatives (i.e., the number of points not found both in the reference and the classification). Similar to Li. et al. [11], precision/recall were used to represent the classification quality in this paper. Quantitative evaluations of the experimental results are provided in Table 2. The classification accuracies of the MLS point clouds of blocks I and II were almost identical, at 83.18% and 82.98%, respectively; and the classification accuracy of the ALS point cloud of block III was relatively higher at 94.06%, due to its sparser point density and simpler object types (i.e., vegetation, buildings, and cars). Additionally, compared with the small-size objects (e.g., cars and lamps), the large-size objects (e.g., vegetation and buildings) were easier to classify. The incomplete objects caused by occlusion did not influence the classification of the large-size objects, while the shape of the small-size objects were easily obscured by the occlusion. Moreover, Table 2 shows that the recalls of vegetation and cars were higher than their precisions. Since the building walls and lamps were easily obscured by the surrounding trees, the precisions of the buildings and lamps were higher than their recalls. To evaluate the advantage of multi-scale voxelization over single-scale, three single-scale voxelizations (finest, middle, and coarsest scales) with the neighboring sizes of 0.385 m, 1.54 m 3 , and 6.16 m respectively, were compared and tested using the point cloud of block I. The quantitative assessments of the experimental results are shown in Table 3. It can be seen that different voxelizations had different classification successes for objects that were of different sizes. For the finest voxelization, small-size objects, such as lamps, were classified satisfactorily, while it was difficult to distinguish buildings and vegetation. Coarser voxelization was more appropriate for extracting the distinctive feature of large-size objects. Comparing Table 3 with Table 2, it was concluded that multi-scale voxelization adaptively characterized and classified all of the objects better, regardless of their types and sizes.

Discussion
To further determine the performance of the proposed MSNet, comparison experiments with other state-of-the-art methods and generalization capability analysis were conducted to accomplish these experiments; another two data sets (terrestrial laser scanning [11,15] (TLS-Wang) and ALS [21] (ALS-Zhang)) point clouds were utilized. The TLS-Wang point cloud was obtained with a single terrain scanner, in which the majority of the objects were buildings, trees, cars, and pedestrians. The ALS-Zhang point cloud was acquired by a Leica ALS50 system with a mean flying height 500 m above ground, which contained three kinds of objects (vegetation, buildings, and cars). The details of the TLS-Wang and ALS-Zhang data sets can be found in Wang et al. [15] and Zhang et al. [21], respectively. Each data set included two scenes (i.e., scene I and scene II), as shown in Figure 11. Scene I was used for comparison experiments, and scene II was used for generalization capability analysis. Additionally, their ground points were removed prior to the analysis.

Comparision with Other Methods
Considering the similarity of point cloud density and object size, the parameter settings for the TLS-Wang and ALS-Zhang point clouds in this section were the same as for the experiments on the MLS (MLS-WHU) and ALS (ALS-WHU) point clouds respectively. The number and size of the multi-scale voxelization were also the same. Additionally, there were only a few training samples in the data sets, which were insufficient for network training. To enrich the diversity of the samples, we randomly selected 50,000 samples, which accounted for 25% of the total amount, and rotated them with two arbitrary angles around a vertical axis. Together with the original samples in Wang et al. [15] and Zhang et al. [21], a total of 400,000 training samples were ultimately used for both the TLS-Wang and ALS-Zhang point clouds. The classification results of scene I are shown in Figure 12, and their corresponding ground truths are provided in Figure 11a,c. It can be seen that the proposed method successfully classified most of the objects in the TLS-Wang and ALS-Zhang point clouds. For the TLS-Wang point cloud classification, we compared the proposed method to other state-of-the-art methods (sLDA model [46], LDA model [15], object-oriented decision tree [11], and PointNet++) [41]. The precision/recall of each kind of object and the overall accuracy are listed in Table 4. In general, the proposed method achieved the highest accuracy at 98.24%, which was far higher than for the other methods. Moreover, its classification accuracy was also higher than the experimental results of the WHU data, as the TLS-Wang scene was relatively simpler. Similar to the experiments of the WHU point cloud, large-size objects (vegetation and buildings) were relatively easier to classify, as they were not sensitive to part of the occlusion and incompleteness.
For the ALS-Zhang point cloud classification, some other methods, including Guo et al. [47], Zhang et al. [21], and PointNet++ [41] were compared with the proposed method. Table 5 shows the precision/recall for each kind of object and the overall accuracy. It is noted that our proposed method did not only work well on a TLS point cloud, it also achieved the highest accuracy for ALS point cloud classification. The classification accuracy with our proposed method was 97.02%, which was far higher than the results achieved by other methods. For all of the comparison methods, cars were the most difficult to classify, due to the discretize error of the insufficient sampling effect. Moreover, some piecemeal and low vegetation was mistakenly classified, because the overhead view of the shape was confused with the cars and buildings.

Genegralization Capability Analysis
This section primarily focuses on the generalization capability of the proposed MSNet. In our previous experiments, it was necessary to collect training samples before the classification step. However, to manually collect enough samples for each classification would be cumbersome and unacceptable. Therefore, the cross-scene generalization capability of the proposed MSNet, which measures the applicability of the pre-trained MSNet over one scene to other scenes of point clouds, was also an important aspect for network assessment.
For the TLS point cloud, we tested the TLS-Wang scene I with two different MSNets, which were trained with the MLS-WHU and TLS-Wang scene II point clouds, respectively. The classification results are provided in Figure 13a,b. Although there were some incorrect classifications (marked in black circles), both achieved satisfactory classification results. The incorrect classifications were attributed to the lack of similar samples in the training step.
For the ALS point cloud, we tested the ALS-Zhang scene I with two different MSNets that were trained with the ALS-WHU and ALS-Zhang scene II point clouds, respectively. The classification results are shown in Figure 14a,b. Part of the cars were mistakenly classified, because the point density of WHU and Zhang et al. [21] were different (the ALS-Zhang point cloud was about three times denser than the WHU point cloud). Besides the density, the collected types of vegetation also varied with the experimental scenes, and led to classification errors regarding unknown vegetation types.
Besides the cross-scene tests of the same kind of point cloud, the ALS-Zhang scene I point cloud was also tested with the MSNet that was trained with the MLS-WHU data. In this case, the training and testing point clouds had totally different densities, perspectives, objects, and occlusions. As shown in Figure 14c, most of the buildings that were relatively large in size were correctly identified, while some small cars and piecemeal vegetation were not. Therefore, we concluded that the discriminative feature learned by the MSNet was sufficiently robust.   Table 6 lists the quantitative precision/recall and accuracy of the aforementioned three tests. Generally, for similar scenes with equivalent densities, the pre-trained MSNet performed well. The accuracy of each of the tests was higher than 83%, and the best accuracy of approximate density was 92.74%. The accuracy with objects that were small in size, such as cars and pedestrians, was relatively lower, as small objects are more sensitive to noise and occlusions in point clouds. Due to the multi-scale voxelization and the weight sharing across different scales, similar context features are learned at different scales. It is beneficial to the generalization capability of the proposed method to classify point clouds of different resolutions such as MLS and ALS. However, for different resolutions, the size of voxels should be set according to the size of objects that are to be classified in the point cloud.

Conclusions
The method proposed in this paper provides an efficient point cloud classification approach, which consists of two complementary parts. In the first part, the point cloud is represented as multi-scale 3D voxels, and a corresponding MSNet is proposed to learn the multi-scale discriminative local features and predict the class label of each point. In the second part, the coarse classification results of MSNet are globally optimized using CRF with a spatial consistency constraint on the point cloud.
Compared with the existing point cloud feature extraction methods, which mainly focus on designing and extracting features subjectively, the feature extraction in our method is adaptive and learning-based. With the proposed multi-scale voxelization of MSNet, the multi-scale discriminative feature of a point cloud is adaptively extracted and fused to comprehensively characterize the local spatial context of each point in a concise way.
To address the MSNet classification inconsistency of one object cluster, which is caused by the point-wise class prediction, CRF with spatial consistency is constructed based on the graph of the point cloud to achieve a global optimization for all of the predicted class labels.
The experimental results show that the proposed method not only works well for MLS point clouds, it also achieved a much higher classification accuracy on ALS and TLS point clouds compared with the state-of-the-art methods, at 97.02% and 98.24%, respectively, thereby demonstrating the strong generalization capability of the proposed network for point cloud classification under complex urban environments.
However, the proposed solution also has its limitations. Although the multi-scale voxelization of point clouds substantially reduced the computation expense compared with a traditional CNN, further improvement is possible for the point-wise classification method. Therefore, a new convolution kernel with angle parameters, which can adopt the manifold structure and efficiently handle the point cloud within the linear computation, will be considered in our future work. Additional experiments on larger data sets are also a possibility in the future.